# Base Python principles

In the introduction to Python class, we learned about using the Pandas and Matplotlib libraries, but didn't learn much about the syntax of Python itself. In particular, all of the notebooks we ran in Introduction to Python were _linear_: Python executed the lines of code in order from top to bottom, without repeating or skipping any. In larger projects, it is often desirable to have _non-linear control flow_, where the lines are not executed in order—some may be repeated, some may be executed in a place other than where they were written, and some may be skipped altogether.

Initially, this would seem to make writing code much more complicated—and in some ways it does. But it also makes code simpler in many cases. For instance, you might need to apply the same calculation many times to different columns, or different values; using nonlinear control flow allows you to do this without duplicating code. You might want a notebook to work on multiple data files (for instance, one notebook that produces quarterly reports should be able to handle data from any quarter, just by changing the files it reads). This might imply some special cases with code that should only be executed in certain circumstances. Both of these allow you to follow the common programming idiom "Don't Repeat Yourself"—i.e. don't have the same or very similar code repeated throughout the project. The main reason for this is maintenance. Suppose you had some aggregation that you used throughout your analysis, and decide to change it—if it's coded in a bunch of different places, you have to find and fix them all, which could lead to errors or omissions. If that code is written once, you only have to fix it once.

We are not importing pandas here, since we will be learning about built-in functionality of Python. We are importing a few built in libraries that come with Python.

In [None]:
import datetime
import random

## Conditionals

The first tool for non-linear control flow we will be talking about is the conditional or if statement. This statement executes the following code only if the condition specified in the statement is true. For instance, the following code cell will run only if the seconds component of the current time is less than 30. Run it several times, until the output changes.

In this cell, we also see the first use of an _indented block_. Notice that the `print` line is indented from the left margin. This signals to Python that this line is part of the body of the if statement - i.e. it should be executed based on whether the current second is less than 30. Any number of lines below the if statement may be indented, and blank lines and comments may be included. The body of the if statement ends at the first non-blank, non-comment line that is not indented.

If you're familiar with other programming languages, you've likely seen indentation used like this before. In most programming languages, this is a stylistic choice. It makes the code easier to read, but does not affect what it does. However, in Python, the indentation actually affects what the code does, because it determines which lines of code make up the body of a particular section of the code.

In [None]:
current_time = datetime.datetime.today()

if current_time.second < 30:
    print("Seconds are less than 30")

The `if` statement can also be optionally paired with an `else` statement. When there is an `if` paired with an `else`, the body of one of those statements will always be executed - either the `if` if the condition is true, or the `else` otherwise.

In [None]:
current_time = datetime.datetime.today()

if current_time.second < 30:
    print("Seconds are less than 30")
else:
    print("Seconds are greater than or equal to 30")

Finally, there is also an `elif` ("else if") statement, which combines the two when you want to have more than two possible outcomes:

In [None]:
current_time = datetime.datetime.today()

if current_time.second < 20:
    print("Seconds are less than 20")
elif current_time.second < 40:
    print("Seconds are between 20 and 40")
else:
    print("Seconds are greater than or equal to 40")

### The conditions

Any condition that evaluates to `True` or `False` may be used by an `if` statement. There are lots of ways to do this, but the most common way is with comparison operators:

- `a < b`: True if a is less than b. a and b can be variables or literal numbers typed out in your code
- `a <= b`: True if a is less than or equal to b.
- `a > b`, `a >= b`: a is greater than/greater than or equal to b.
- `a == b`: a equals b (note-this is a double equal sign. A single equal sign is for assignment, a double for comparison)
- `a != b`: a does not equal b
- `a in b`: b contains a. Usually b will be a list or array of some sort, which we'll discuss briefly.
- `a % b == 0`: a is divisible by b

### Exercise

The cell below sets "value" to a random value between 0 and 1. Write an if...elif...else statement that prints whether it is less than 0.25, between 0.25 and 0.75, or greater than 0.75.

In [None]:
value = random.random()

## Lists and tuples

Lists and tuples are both objects that can contain multiple items. The only difference is that a list is _mutable_ (meaning you can change the elements of it after it's created), while a tuple is _immutable_. These are conceptually similar to `pandas` data frame columns or `numpy` arrays, with some technical differences we'll discuss a bit later. Lists are created with comma-separated elements within square brackets, while tuples are created using comma-separated elements within parentheses.

In [None]:
mylist = ["this", "is", "a", "test"]

In [None]:
mytuple = ("this", "is", "a", "test")

To create a tuple with a single element, you need to include a comma after the element, so Python knows you want a tuple rather than specifying order of operations in a mathematical context.

In [None]:
mytuple1 = ("test",)

In [None]:
mylist

In [None]:
mytuple

In [None]:
mytuple1

You can reference items in lists and tuples by number, using subscript notation. The numbers start with 0.

In [None]:
mylist[1]

In [None]:
mytuple[2]

## Dicts

Dicts, or dictionaries, map a _key_ to a _value_. For instance, the code below defines a dictionary that maps foods to whether they are fruits or vegetables.

In [None]:
foods = {
    "cucumber": "vegetable",
    "strawberry": "fruit",
    "apple": "fruit",
    "squash": "vegetable",
    "potato": "vegetable",
    "mango": "fruit"
}
foods

Like with lists, items in dicts can be looked up with subscript notation.

In [None]:
foods["apple"]

## Loops

Sometimes, you may want to run the same section of code multiple times, for instance if you wanted to apply some operation to every item in a list. Loops allow you to do this. There are two types of loops in Python: `for` loops and `while` loops. `for` loops loop a set number of times determined at the outset; this might be one _iteration_ for each value in a list, or simply looping a defined number of times. `while` loops loop until some condition is not true.

In the next cell, I define a list of numbers, and loop over them, printing out the squared value of each.

The syntax of a for loop is `for variable in list:`. The body of the loop (indented portion) will be executed once for each item in the list, and `variable` will refer to the current item. You can call variable pretty much anything you like.

In [None]:
numbers = [2, 6, 12, 3, 5]

for number in numbers:
    print(number ** 2)

It's fairly common to want to loop a specified number of times, and Python provides a handy function called `range` for that. `range(3)`, for example, produces a list `[0, 1, 2]` - with three items. The `i` variable below references the current number within the loop.

In [None]:
for i in range(3):
    print(i ** 2)

While loops are much less common than for loops, but are used when you don't know how many times the loop should run at the start of the loop. Instead, they have a condition at the top which is evaluated each time the loop runs, to determine if the loop should keep going. These are often used in model estimation processes, for example, when waiting for a model to reach some convergence/stopping criterion. For instance, the loop below finds all Fibonacci numbers less than 100. We don't know how many there are at the outset, so we use a while loop.

In [None]:
previous = 0
current = 1

while current < 100:
    next_fib = previous + current
    print(next_fib)
    previous = current
    current = next_fib

This list also includes 144—the first Fibonacci number greater than 100. Why is this?


<details>
    <summary>Answer</summary>
    
    The while loop condition is evaluated at the start of each loop. When current = 89, the loop runs, and next_fib becomes 144. We can fix this by printing `current` instead of `next_fib`
</details>

In [None]:
previous = 0
current = 1

while current < 100:
    print(current)
    next_fib = previous + current
    previous = current
    current = next_fib

### Control flow in loops

There are two additional statements that can be used in loops: `break` and `continue`. `break` ends the loop immediately, while `continue` goes on to the next iteration immediately, skipping any remaining code in the loop. They are always combined with an `if` statement, so they do not run on every loop iteration; otherwise there wouldn't be much point in using a loop. For example, we can rewrite the code above looping over numbers and squaring them to skip any numbers larger than 10. This also demonstrates _nesting_ loops and if statements. You can nest these together, and the body of each is determined by the level of indentation.

In [None]:
numbers = [2, 6, 12, 3, 5]

for number in numbers:
    if number > 10:
        continue
    
    print(number ** 2)

We could do the same thing, but instead exit the loop if there was a number greater than 10:

In [None]:
numbers = [2, 6, 12, 3, 5]

for number in numbers:
    if number > 10:
        break
    
    print(number ** 2)

The break statement is often used in conjunction with a `while` loop, in conditions where the stopping criteria are complex and can't easily be determined at the start of the function. For instance, we might test the [Collatz conjecture](https://en.wikipedia.org/wiki/Collatz_conjecture). This is a deceptively simple open problem in mathematics, which states that repeatedly applying the simple operation of dividing even numbers by 2 and tripling odd numbers and adding 1 will eventually result in 1 for any positive integer starting value. No one has been able to prove this, but no one has found a counterexample, either. We will use a while loop combined with if statements to calculate the Collatz conjecture.

(If this class ends up inspiring you to solve it, mention me in your Fields medal acceptance speech.)

In [None]:
collatz = 17 # change this to whatever you like

while True:
    print(collatz)
    if collatz % 2 == 0:
        # the number is even
        # // is integer divide in Python. The number is even so we know it can be divided by two and
        # still result in an integer.
        # Most mathematical operations have an "assigning" version with an = appended, that assigns the result
        # back to the variable on the left hand side
        collatz //= 2 
    else:
        # the number is odd
        collatz *= 3
        collatz += 1
    
    if collatz == 1:
        print("Success: Collatz number is 1")
        break

### Exercise

Re-write the code above which computes the Collatz conjecture without using a `break` statement.

## Functions

_Functions_ are re-usable pieces of code that you define once and can then run many times. Functions can optionally take one or more _arguments_, which affect how they work, and they _return_ a value. They work just like functions in math, which operate on some input and evaluate to some output. We've been using functions since we started using Python - for example, `read_csv`, `sum`, `print`, etc. are all functions. Now we'll learn to create our own.

Functions allow you to write code once and re-use it, rather than having the same or similar code all over the place in your notebooks. They even allow you to move your code out of your noteboooks and into shared files. And they allow you to write automated tests to help ensure your analysis is correct.

In Python, the `def` (define) statment is used to create a function. It is followed by the name of the function, and the arguments in parentheses. An indented block follows which defines the _body_ of the function - the code that will be executed.

When you define a function, the code is not executed. Rather, it is stored, and is executed later, when you _call_ the function.

We can define a function that computes the Collatz conjecture for any starting point - the starting point will be an argment to the function.

In [None]:
def collatz_conjecture (current_value):
    while True:
        print(current_value)
        if current_value % 2 == 0:
            # the number is even
            # // is integer divide in Python. The number is even so we know it can be divided by two and
            # still result in an integer.
            # Most mathematical operations have an "assigning" version with an = appended, that assigns the result
            # back to the variable on the left hand side
            current_value //= 2 
        else:
            # the number is odd
            current_value *= 3
            current_value += 1

        if current_value == 1:
            print("Success: Collatz number is 1")
            break

The function has now been defined, but none of the code in it has actually been run yet. The code is run when we _call_ the function—by referencing its name, followed by parentheses containing the arguments to the functions, separated by commas if there is more than one.

In [None]:
collatz_conjecture(18)

### Function return values

Our Collatz function just printed out the Collatz sequence with the `print` function. However, it is usually more useful to return a value, so that it can be used in future computations. We can write a new function that returns the next number in the Collatz sequence.

In [None]:
def next_collatz (value):
    if value % 2 == 0:
        # value is even
        return value // 2
    else:
        return 3 * value + 1

Now, running `next_collatz` will _return_ the next value in the sequence. At first, this doesn't really look any different, as Jupyter Notebook will automatically print out the return value of the last line of a cell if it is not saved to a variable.

In [None]:
next_collatz(17)

However, if we store the result in a variable, nothing is printed out.

In [None]:
after_17 = next_collatz(17)

We can then refer to that value later.

In [None]:
after_17

### Exercise

Rewrite the while loop that calculates and prints Collatz sequences to use the `next_collatz` function.

## List and dict comprehensions

The final language features we'll talk about are _list and dict comprehensions_. This is a way of combining a for loop with creating a list. It can be used to transform the elements of a list:

In [None]:
mylist = ["this", "is", "a", "test"]
capitalized = [word.upper() for word in mylist]
capitalized

It can also be used to filter a list, for instance filtering only words with more than one letter:

In [None]:
[word for word in mylist if len(word) > 1]

### Exercise

Write a list comprehension that selects all words longer than three letters from `capitalized`, and lowercases them.

### Dict comprehensions

A dict comprehension is very similar to a list comprehension, except that it creates a dict rather than a list. It is enclosed in brackets `{}` and uses a `:` to separate the key from the value. We can create a dictionary mapping a word to the length of that word.

In [None]:
{word: len(word) for word in mylist}

You can use variables or function calls on either side of the `:`, and you can use if statements to filter results, just like with list comprehensions. For instance, to map capitalized versions of words that start with T to their lengths, we might write:

In [None]:
{word.upper(): len(word) for word in mylist if word.lower().startswith("t")}

### Exercise

Write a dict comprehension mapping lowercase to uppercase versions of words, for all words in mylist that have a length longer than 1 letter.