# Python and Jupyter Notebooks

Welcome to Python! Python is a high-level, object oriented, interpreted language that focuses on being user-friendly at the cost of performance and speed. At the same time, Python has great support for using different libraries, and many of these libraries _are_ capable of achieving good performance. In this course, we're going to learn how to use Python and its libraries to perform data analysis on large datasets, and present the results in an intuitive and meaningful way. Until we plot data, they're just floating point numbers sitting on a disk, so we want to make sense out them.

To help us do this, we're going to make use of Jupyter Notebooks (like the one you're reading now). Jupyter is a common framework for several different languages ("Jupyter" was originally an acronym-ish of "Julia, Python, and R") and an excellent way to combine running code, description, and plots all into one package. Jupyter lets you write Python code and execute it in an interactive fashion, and share your results in an easy and straightforward way. Jupyter notebooks are an industry-standard way of exploring and sharing data in the Data Science community, and are flexible and easily adapted. But enough talk, let's dig in!

## Jupyter Notebook Basics

Jupyter notebooks have two kinds of cells: "code" and "markdown". "Code" are live code blocks that can be executed by hitting the "Run" button in the bar, or by hitting "Shift + Enter" on the keyboard. "Markdown" cells contain text input, which can be formatted as text, bulleted lists, code, and even LaTeX for typeset mathematics. We'll make use of all of these types in the examples below.

## Executing Code

The code cells in Jupyter notebooks can be executed and print output directly to the notebook. This output can be text or even figures. To execute code, the Jupyter notebook needs to spawn a _kernel_, which is a background task that handles all of the actual execution. You can still view Jupyter notebooks even without a kernel, which makes them useful for sharing data and results. But to do any live execution, a kernel is required. This kernel can either run locally, on your own machine, or on a remote computing cluster, where the Jupyter notebook is running in your browser window but executing code elsewhere. This approach makes notebooks very flexible and powerful.

# Python basics

Python is an _interpreted_ language, which means that it does not need to be compiled before it can be run. As a trade-off, this makes it slower than a compiled language, but we can make use of high-performance libraries to recoup some of these losses for performance-critical code. But to start out, we'll begin with some of the basics of Python.

Python is a _dynamically typed_ language, which means that variable types are determined at runtime. An implication of this is that variable types do not need to be declared before they are used. The interpreter will determine the type of objects as it encounters them and adjust accordingly. Python is also a _strongly typed_ language, which means that once variable types are determined, there are well-defined rules for how operations between them behave. In particular, Python does relatively little type coercion to get disparate types to work for a given operation. Instead, the programmer can explicitly _cast_ objects to different types if needed. This involved calling a function, like `int` on a datatype or variable to convert its type.

Let's start off with the classic "Hello world!" example. In Python, this looks like this:

In [None]:
print("Hello, world!")

In the above code, we used the `print` function, which takes a string as an argument, which then prints the content to standard out. String data in Python is denoted using quotes. Single quotes (`'`) and double quotes (`"`) are equivalent.

Now let's do some basic math:

In [None]:
1 + 1 

We have added together two integers. So far, so good! What about floating point numbers?

In [None]:
1 + 1.5

Python was smart enough to figure out that I needed to use floating point arithmetic to avoid losing precision, and automaticall promoted my integer data accordingly. It will keep the data as floating point, even if it can be rounded, and format my output accordingly:

In [None]:
1.5 + 1.5

In [None]:
int(3.0)

We can also do multiplication and division:

In [None]:
2 * 2

In [None]:
1.5 * 2

In [None]:
3 / 2

In this last example, note that I put in two integers, but got a floating point number as output. This is a change from Python 2 to Python 3! I can get integer division by using the `//` operator:

In [None]:
3 // 2

We can also do exponentiation using the `**` operator:

In [None]:
2**3

Python is also capable of performing complex arithmetic, where we use the `j` suffix to denote the imaginary unit:

In [None]:
(1 + 1j) * (2 - 3j)

# Variable Assignment

We can _assign_ objects to variable names by using the assignment operator `=`. This allows us to reference them later. We do not need to tell Python the type of a variable ahead of time: the interpreter will figure it out for us!

In [None]:
a = 1
print(a)

In [None]:
a = 1
b = 2
a + b

Note that statements are evaluated immediately, even if the assigned value changes later:

In [None]:
a = 1
b = 2
c = a + b
b = 10
print("a, b, c: ", a, b, c)

# Python Data Types

The above operations were performed on primitive data types, like integers and floating point numbers. We can find out the type of an object using the `type` function:

In [None]:
type(1)

In [None]:
type(1.0)

In [None]:
type("Hello, world!")

In [None]:
type(True)

In [None]:
type((1 + 1j))

## Lists

We can also have collections of objects. The most basic object in Python is the `list`. We denote a list using square brackets `[ ]`. Lists can contain objects of different types, and preserve the type of individual elements:

In [None]:
mylist = [1, 1.0, "Hello, world!", True]
print(mylist)

We can access individual elements by _indexing_ the list, using square brackets following the variable name without a space. Lists are 0-indexed in Python. For example, to get the first ("zero-th") element of `mylist`, I can do:

In [None]:
mylist[0]

We can get multiple elements from the list using _slice_ notation. In general, we write a slice `a:b:i`, where `a` is the starting index, `b` is the ending index, and `i` is the interval. Note that `b` is exclusive, meaning that Python will select up to _but not including_ index `b`. Note that it is not necessary to specify all values `a`, `b`, and `i` when writing a slice. If not included, `a` defaults to 0, `b` defaults to `N` (the length of the list), and `i` defaults to 1. Some examples:

In [None]:
mylist2 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
print(mylist2[0:3])

In [None]:
print(mylist2[1:4])

In [None]:
print(mylist2[0:6:2])

In [None]:
print(mylist2[:])

In [None]:
print(mylist2[::3])

In [None]:
print(mylist2[0:-1])

Note that in this last example, I have used a negative number as an index. This means "count backward this many elements".

If we try to fetch an element at an index that is out of bounds, Python will notice and complain:

In [None]:
print(mylist2[15])

Lists have several useful _functions_ and _methods_ associated with them. Functions are invoked using parethesis `( )` after the name of the function, and can take zero or more arguments. Methods are functions that are defined on objects, and are invoked using the `.` syntax following the variable name. For instance, the `len` function will tell us the length of a list:

In [None]:
len(mylist2)

We can add an element to the end of a list using the `append` method, where we pass as an argument the object we'd like to add:

In [None]:
print(mylist2)
mylist2.append(10)
print(mylist2)

In [None]:
del mylist2[3]
print(mylist2)

In [None]:
myval = mylist2.pop(1)
print(mylist2)
print(myval)

I can also "add" two lists together, where I combine them into a single list. We can use the `+` operator to do this:

In [None]:
list1 = [0, 1, 2]
list2 = [3, 4, 5]
big_list = list1 + list2
print(big_list)

What happens if we try to subtract?

In [None]:
small_list = list1 - list2

# Tuples

Tuples are like lists, in that they are a collection of objects, but they are _immutable_ once they have been defined. Although we can index tuples like we can for lists, we cannot change them once we have made them. We denote tuples by using parenthesis `( )`.

In [None]:
mytuple = (1, 2.2, "hello", None)
print(mytuple[1])

Note that the `len` function still works:

In [None]:
len(mytuple)

But the `.append` method is not defined:

In [None]:
mytuple.append(4)

However, we _can_ add tuples together to get a new tuple:

In [None]:
tuple1 = (0, 1)
tuple2 = ("hello", None)
tuple3 = tuple1 + tuple2
print(tuple3)

## Dictionaries

Dictionaries are one of the most common collections in Python. A dictionary consists of key-value pairs, where we use the key as an index to retrieve the associated value. The keys for dictionaries must be _immutable_, but the values can be anything (even other dictionaries!). We denote a dictionary using curly braces `{ }`, and `key: value` pairs. For example:

In [None]:
mydict = {"key1": "hello", "key2": "there"}

We index a dictionary by using `dictionary_name[key]` syntax. For example:

In [None]:
print(mydict["key1"])
print(mydict["key2"])

If we try to use a key that does not exist on the dictionary, Python will complain:

In [None]:
print(mydict["key3"])

If we assign a value to a dictionary key that already exists, Python will update the value in the dictionary:

In [None]:
mydict["key2"] = "world"
print(mydict["key2"])

This means that to store multiple values in a dictionary, we must use unique keys.

However, note that the `type` of the key matters! The key `1` is different from the key `"1"`:

In [None]:
mydict = {"1": "the string 1", 1: "the int 1"}
print(mydict[1])
print(mydict["1"])

In [None]:
mydict = {"mylist": [0, 1, "hi", True]}
print(mydict["mylist"])

In [None]:
print(mydict.keys())

In [None]:
mydict["mytuple"] = (1, 2)
print(mydict.keys())

# For Loops

Like most languages, Python supports iterating through a block of code for a certain number of iterations. For example:

In [None]:
for i in range(10):
    print("body of loop: ", i)
    
print("outside of loop: ", i)

Unlike other languages, Python does not use braces to define the body of the loop. Instead, white space is meaningful! The body of the loop is indented with respect to the definition (typically 4 spaces), and the loop is ended when the code has been dedented. Python uses a similar convention when defining functions, so this will become very familiar.

We can also iterate through objects in collections. We use a similar syntax:

In [None]:
for element in mylist:
    print(element)

We can also use the handy `enumerate` function if we want the index of the elements, as well as the elements themselves:

In [None]:
for i, element in enumerate(mylist):
    if i > 2:
        break
    print("index: ", i)
    print("element: ", element)

We can also build `while` loops that will continue to iterate _while_ a particular condition is true, which has a similar construction to `for` loops. Also good to know: the `continue` statement will skip the rest of the loop and move on to the next iteration, and `break` will stop executing the loop entirely.

# If Statements

One of the essential elements of control flow, `if` statements will execute a block of code only if the conditional statement is true. Like `for` loops, the code is indented to indicate the body of the `if` statement. We can also have an `else` clause, which will execute if the `if` statement is not true. An example:

In [None]:
myint = 4

if myint % 2 == 0:
    print("even")
else:
    print("odd")

We can have branching `if` statements as well. We indicate this structure using the `elif` keyword:

In [None]:
myint = -3

if myint > 0:
    print("positive")
elif myint < 0:
    print("negative")
else:
    print("zero")

Be careful of type handling! When comparing objects with different types, the comparison operators will try to coerce types to be similar if the operation is defined for these objects. For example, strings and numbers are treated as not being equal, even if they would be after casting:

In [None]:
if 1 == 1.0:
    print("equal")

In [None]:
if 1 == 1.00001:
    print("equal")

In [None]:
myint = "3"

if myint == 3:
    print("three")
else:
    print("not three")
    
if int(myint) == 3:
    print("three")
else:
    print("not three")

# Importing Libraries

Libraries, or modules, are collections of objects, functions, and methods that extend the functionality of the base Python language. Some of these libraries ship with the Python runtime itself. These compose the "standard library" of Python, and are always available. Some examples are the `math` library, which allows for access to more advanced mathematical functions, the `os` library, which interacts with the operating system, and the `json` library, which can read and write JSON data. There are many other libraries included in the standard library, and there is plenty of documentation online about what they are and what functions they contain.

To use a library, we use the `import` statement. This defines a new namespace in our runtime, which allows us to access these functions in our own code. For example, let's import the math library and print the value of $\pi$:

In [None]:
import math
print(math.pi)

We can also use trigonometric functions:

In [None]:
print(math.sin(math.pi / 4))

We can calculate square roots:

In [None]:
1 / math.sqrt(2)

We can even do more complicated math:

In [None]:
math.gamma(-3.2)

# Beyond the Standard (Python) Model

There are a lot (a LOT) of Python libraries that do lots of useful things. Here we're going to talk about just two of them: `numpy` and `matplotlib`. `numpy` is a standard library for doing numerical calculations in Python, and `matplotlib` is great for making plots. Let's import them:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Here we have used _aliases_ to import these packages. This means that to access numpy functions, we'll use `np.` instead of `numpy.`. These aliases are fairly standard in the community, and it makes typing them a lot easier.

The fundamental datatype of numpy is the $N$-dimensional array, or `ndarray` for  short. Unlike Python lists, the elements of arrays must all have the same type. The benefit is that this makes them very memory efficient, and operations on them are performed very quickly.

We can cast a Python list to a numpy array using the `np.array` function:

In [None]:
mylist = [0, 1, 2, 3]
myarray = np.array(mylist, dtype=np.int32)
print(myarray)

Numpy arrays are self-describing by using methods and attributes, and have a datatype (`dtype`) associated with them:

In [None]:
print(myarray.dtype)

They also have attributes indicated their shape (`.shape`), and the total number of elements they contain (`.size`):

In [None]:
print(myarray.shape)
print(myarray.size)

We can also index individual elements, like we do with list. We use the same list indexing and slicing syntax:

In [None]:
print(myarray[:2])
print(myarray[1:3])
print(myarray[::2])

Instead of casting Python lists to arrays (as we did above), we can also create `empty` arrays, which are allocated but not initialized. We could also ask for arrays of all `zeros` or `ones`. To do this, we specify the shape of the array, and optionally a datatype:

In [None]:
empty_array = np.empty((2, 4))
zero_array = np.zeros((2, 4), dtype=np.float32)
one_array = np.ones((2, 4), dtype=np.int64)

print(empty_array)
print(zero_array)
print(one_array)

Note that Python uses row-major (also sometimes called "C-style") indexing. This means that for an array that has $N$ rows and $M$ columns, we specify the shape with a tuple `(N, M)`.

Another handy operation for initializing arrays is to use the `linspace` function. This creates an array where elements are linearly spaced between a starting and ending value, where we specify the number of points to use. For example, to specify we want 1,000 points on the interval between 0 and $2\pi$, we can write:

In [None]:
xvals = np.linspace(0, 2 * np.pi, num=1000)
print(xvals)

Like the `math` module, `numpy` knows how to do math operations. We can do things like add or multiply two arrays, as long as their shapes are compatible:

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(3 + a)
print(b * 2)
print(a + b)
print(a * b)

We can provide arrays as arguments, and `numpy` will _broadcast_ the operation to every element in the array. For those familiar with MatLab, this is like using the `.` operator. Unlike MatLab, element-wise operations is the default in `numpy`, so multiplying two arrays will default to multiplying the elements, as shown above (instead of matrix multiplication).

We can also apply mathematical operations to arrays:

In [None]:
yvals = np.sin(xvals)
print(yvals)

If we want to visualize things, we can use `matplotlib` to plot different things. The simplest function is `plot`, which takes $x$-values and $y$-values:

In [None]:
%matplotlib inline

plt.plot(xvals, yvals)

Ta-da! Easy as that.

The `%matplotlib inline` line above is knows as "notebook magic", and tells the kernel how to handle things like images. It's actually usually called at the top of a notebook, but I put it here to avoid confusion.

There's actually two different ways interact with `matplotlib` figures. One is the above way, where we used `plt.plot`. Under the hood, `matplotlib` instantiated objects as needed to make the plot happen. An alternative way (which is generally more precise, but time-consuming) is to create `figure` and `axes` objects ourselves, and call the plotting functions on those. For instance:

In [None]:
fig = plt.figure()
ax = plt.gca()
ax.plot(xvals, 2 * yvals)

Managing figures and axes ourselves is also good practice, to help ensure that things don't get mixed up in a notebook environment.

It's generally good practice to add labels to our axes, so the reader knows what we're plotting. Here there isn't too much in particular, but let's get in the habit anyway. We can add some labels by calling the appropriate methods on the `Axes` objects:

In [None]:
fig = plt.figure()
ax = plt.gca()
ax.plot(xvals, yvals)
ax.set_xlabel("x")
ax.set_ylabel("sin(x)")

We can also make our labels fancy using LaTeX. To do this, we put the parts of our labels we want to be typset in between `$` characters. It often also requires passing a "raw string" to Python, where we preface a string with the `r` character (and no space):

In [None]:
fig = plt.figure()
ax = plt.gca()
ax.plot(xvals, xvals**2)
ax.set_xlabel(r"$x$")
ax.set_ylabel(r"$x^2$")