# A Crash Course in Python

## Whitespace Formatting

Many languages use curly braces to delimit blocks of code. Python uses **indentation**

Python considers tabs and spaces **different indentation** and will not be able to run your code if you mix the two.

When writing Python you should **always use spaces**, never tabs (Jupyter Notebook, by default, converts tabs to spaces. Other editors should be configured to do the same).

In [None]:
# The pound sign marks the start of a comment. Python itself
# ignores the comments, but they're helpful for anyone reading the code.
for i in [1, 2, 3, 4, 5]:
    print("i = ", i)                    # first line in "for i" block
    for j in [1, 2, 3, 4, 5]:
        print("j = ", j)                # first line in "for j" block
        print("i + j = ", i + j)        # last line in "for j" block
    print("i = ", i)                    # last line in "for i" block
print("done looping")

Whitespace is ignored inside parentheses and brackets, which can be helpful for longwinded computations:

In [None]:
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 +
                           13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)

list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

easier_to_read_list_of_lists = [[1, 2, 3],
                                [4, 5, 6],
                                [7, 8, 9]]

print("same 2D list:")
print(list_of_lists)
print(easier_to_read_list_of_lists)

You can also use a backslash to indicate that a statement continues onto the next line, although we’ll rarely do this:

In [None]:
two_plus_three = 2 + \
                 3

print(two_plus_three)

## Modules

Modules not loaded by default can be imported.

In [None]:
import platform

x = platform.system()
print(x)

Alias may also be used, especially for long package names

In [None]:
import platform as pl

x = pl.system()
print(x)

If you need a few specific values from a module, you can import them explicitly and use them without qualification:

In [None]:
from platform import system
x = system()
print(x)

## Functions

A function is a rule for taking zero or more inputs and returning a corresponding output.

In Python, we typically define functions using `def`:

In [None]:
def double(x):
    """
    This is where you put an optional docstring that explains what the
    function does. For example, this function multiplies its input by 2.
    """
    return x * 2

Python functions are *first-class*, which means that we can assign them to variables and pass them into functions just like any other arguments:

In [None]:
def apply_to_one(f):
    """Calls the function f with 1 as its argument"""
    return f(1)

In [None]:
def apply_to_one(f):
    """Calls the function f with 1 as its argument"""
    return f(1)
def double(x):
    """
    This is where you put an optional docstring that explains what the
    function does. For example, this function multiplies its input by 2.
    """
    return x * 2
my_double = double          # refers to the previously defined function
x = apply_to_one(my_double) # equals 2
print(x)

Function parameters can also be given default arguments, which only need to be specified when you want a value other than the default:

In [None]:
def my_print(message = "my default message"):
    print(message)

my_print("hello")   # prints 'hello'
my_print()          # prints 'my default message'

It is sometimes useful to specify arguments by name:

In [None]:
def full_name(first = "What's-his-name", last = "Something"):
    return first + " " + last

print(full_name("Joel", "Grus"))              # "Joel Grus"
print(full_name("Joel"))                      # "Joel Something"
print(full_name(last="Grus", first="First"))  # "First Grus"

## Strings

Strings can be delimited by single or double quotation marks (but the quotes have to match):

In [None]:
single_quoted_string = 'data science'
double_quoted_string = "data science"
#invalid_string = 'data science"

print(single_quoted_string)
print(double_quoted_string)

Python uses backslashes to encode special characters. For example:

In [None]:
tab_string = "\t"       # represents the tab character
print("before tab", tab_string, "after tab")
print(len(tab_string))  # is 1

If you want backslashes as backslashes (which you might in Windows directory names or in regular expressions), you can create raw strings using `r""`:

In [None]:
not_tab_string = r"\t"         # represents the characters '\' and 't'
string_2 = "\\t"
print(string_2)
print(not_tab_string)
print(len(not_tab_string))     # is 2
print(len(string_2))     # is 2

You can create multiline strings using three double quotes:

In [None]:
multi_line_string = """This is the first line.
and this is the second line.
and this is the third line."""

print(multi_line_string)

A new feature since Python 3.6 is the *f-string*, which provides a simple way to substitute values into strings. For example, if we had the first name and last name given separately:

In [None]:
first_name = "Joel"
last_name  = "Grus"

we might want to combine them into a full name. There are multiple ways to construct such a full_name string:

In [None]:
full_name1 = first_name + " " + last_name             # string addition
full_name2 = "{0} {1}".format(first_name, last_name)  # string.format
full_name3 = f"{first_name} {last_name}"              # f-string, preferred

print(full_name1)
print(full_name2)
print(full_name3)

## Exceptions

When something goes wrong, Python raises an exception. Unhandled, exceptions will cause your program to crash. You can handle them using try and except:

In [None]:
try:
    print(0 / 0)
except ZeroDivisionError:
    print("cannot divide by zero")

## Lists

Lists are probably the most fundamental data structure in Python. Similar to *arrays* in other languages, but with some added functionality.

In [None]:
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [integer_list, heterogeneous_list, []]

list_length = len(integer_list)     # equals 3
list_sum    = sum(integer_list)     # equals 6

print(list_length)
print(list_sum)

You can get or set the nth element of a list with square brackets:

In [None]:
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

zero = x[0]          # equals 0, lists are 0-indexed
one = x[1]           # equals 1
nine = x[-1]         # equals 9, 'Pythonic' for last element
eight = x[-2]        # equals 8, 'Pythonic' for next-to-last element
x[0] = -1            # now x is [-1, 1, 2, 3, ..., 9]

print(f"zero  = {zero}")
print(f"one   = {one}")
print(f"nine  = {nine}")
print(f"eight = {eight}")
print(f"x = {x}")

You can also use square brackets to slice lists. The slice i:j means all elements from i (**inclusive**) to j (**exclusive**).

If you leave off the start of the slice, you’ll slice from the beginning of the list, and if you leave of the end of the slice, you’ll slice until the end of the list:

In [None]:
first_three = x[:3]                 # [-1, 1, 2]
three_to_end = x[3:]                # [3, 4, ..., 9]
one_to_four = x[1:5]                # [1, 2, 3, 4]

# -3 can be interpreted as len(x)-3
last_three = x[-3:]                 # [7, 8, 9]
without_first_and_last = x[1:-1]    # [1, 2, ..., 8]
copy_of_x = x[:]                    # [-1, 1, 2, ..., 9]

print(f"first_three = {first_three}")
print(f"three_to_end = {three_to_end}")
print(f"one_to_four = {one_to_four}")
print(f"last_three = {last_three}")
print(f"without_first_and_last = {without_first_and_last}")
print(f"copy_of_x = {copy_of_x}")

A slice can take a third argument to indicate its *stride*, which can be negative:

In [None]:
every_third = x[::3]                 # [-1, 3, 6, 9]
five_to_three = x[5:2:-1]            # [5, 4, 3]

print(f"every_third = {every_third}")
print(f"five_to_three = {five_to_three}")

Python has an `in` operator to check for list membership:

Note that this is a linear search algorithm that has the runtime of *O(n)*, where *n* is the length of the list.

In [None]:
print(1 in [1, 2, 3])    # True
print(0 in [1, 2, 3])    # False

You can modify a list in place, or you can use `extend` to add items from another collection:

In [None]:
x = [1, 2, 3]
x.extend([4, 5, 6])     # x is now [1, 2, 3, 4, 5, 6]

print(f"x = {x}")

If you don’t want to modify `x`, you can use list addition:

In [None]:
x = [1, 2, 3]
y = x + [4, 5, 6]       # y is [1, 2, 3, 4, 5, 6]; x is unchanged

print(f"x = {x}")
print(f"y = {y}")

More frequently we will append to lists one item at a time:

In [None]:
x = [1, 2, 3]
x.append(0)      # x is now [1, 2, 3, 0]
y = x[-1]        # equals 0
z = len(x)       # equals 4

print(f"x = {x}")
print(f"y = {y}")
print(f"z = {z}")

It’s often convenient to *unpack* lists when you know how many elements they contain:

In [None]:
# Must have the same number of values on both sides
x, y = [1, 2]    # now x is 1, y is 2

print(f"x = {x}")
print(f"y = {y}")

If you do not care about a value, an underscore can be used to represent a value that you’re going to throw away:

In [None]:
_, y = [1, 2]    # now y == 2, didn't care about the first element

print(f"y = {y}")

## Tuples

Tuples are lists’ **immutable** cousins. Pretty much anything you can do to a list that doesn’t involve modifying it, you can do to a tuple.

You specify a tuple by using parentheses (or nothing) instead of square brackets:

In [None]:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3      # my_list is now [1, 3]

print(f"my_list = {my_list}")
print(f"my_tuple = {my_tuple}")
print(f"other_tuple = {other_tuple}")

try:
    my_tuple[1] = 3
except TypeError:
    print("cannot modify a tuple")

Tuples are a convenient way to return multiple values from functions:

In [None]:
def sum_and_product(x, y):
    return (x + y), (x * y)

sp = sum_and_product(2, 3)     # sp is (5, 6)
s, p = sum_and_product(2, 3)   # s is 5, p is 6

print(f"sp = {sp}")
print(f"s  = {s}")
print(f"p  = {p}")

Tuples (and lists) can also be used for *multiple assignment*:

In [None]:
x, y = 1, 2     # now x is 1, y is 2

print("before swap:")
print(f"x = {x}, y = {y}")
x, y = y, x     # Pythonic way to swap variables; now x is 2, y is 1
print("after swap:")
print(f"x = {x}, y = {y}")

## Dictionaries

Another fundamental data structure is a dictionary, which associates *values* with *keys* and allows you to quickly retrieve the value corresponding to a given key.

They are similar to *maps* in other languages. 

In [2]:
empty_dict = {}                     # Pythonic
empty_dict2 = dict()                # less Pythonic
grades = {"Joel": 80, "Tim": 95}    # dictionary literal

print(f"grades = {grades}")

grades = {'Joel': 80, 'Tim': 95}


You can look up the value for a key using square brackets:

In [3]:
joels_grade = grades["Joel"]        # equals 80

print(f"joels_grade = {joels_grade}")

joels_grade = 80


But you’ll get a `KeyError` if you ask for a key that’s not in the dictionary:

In [None]:
try:
    kates_grade = grades["Kate"]
except KeyError:
    print("no grade for Kate!")

You can check for the existence of a key using `in`:

In [None]:
print("Joel" in grades)     # True
print("Kate" in grades)     # False

print(100 in grades.values()) # check if 100 is one of the values

Dictionaries have a `get` method that returns a default value (instead of raising an exception) when you look up a key that’s not in the dictionary:

In [None]:
joels_grade = grades.get("Joel", 0)   # equals 80, equivalent to grades["Joel"]
kates_grade = grades.get("Kate", 0)   # equals 0
no_ones_grade = grades.get("No One")  # default is None

print(f"joels_grade = {joels_grade}")
print(f"kates_grade = {kates_grade}")
print(f"no_ones_grade = {no_ones_grade}")

You can assign (or add) key/value pairs using the same square brackets:

In [None]:
grades["Tim"] = 99                    # replaces the old value
grades["Kate"] = 100                  # adds a third entry
num_students = len(grades)            # equals 3

print(grades["Tim"])
print(grades["Kate"])
print(num_students)
print(grades)

Dictionaries can be used to represent structured data:

In [None]:
tweet = {
    "user" : "joelgrus",
    "text" : "Data Science is Awesome",
    "retweet_count" : 100,
    "hashtags" : ["#data", "#science", "#datascience", "#awesome", "#yolo"]
}

print(tweet)

Besides looking for specific keys, we can look at all of them:

In [None]:
tweet_keys   = tweet.keys()     # iterable for the keys
tweet_values = tweet.values()   # iterable for the values
tweet_items  = tweet.items()    # iterable for the (key, value) tuples

print(f"tweet_keys = {tweet_keys}")
print(f"tweet_values = {tweet_values}")
print(f"tweet_items = {tweet_items}")

print("user" in tweet_keys)         # True, but not Pythonic
print("user" in tweet)              # Pythonic way of checking for keys, discussed earlier
print("joelgrus" in tweet_values)   # True (slow but the only way to check)

### defaultdict

A `defaultdict` is like a regular dictionary, except that when you try to look up a key it doesn’t contain, it first adds a value for it using a zero-argument function you provided when you created it.

In order to use `defaultdict`, you have to import them from `collections`:

In [4]:
from collections import defaultdict

document = ["data", "science", "from", "scratch", "more", "data"]

word_counts = defaultdict(int)         # int() is a function that produces 0
for word in document:
    word_counts[word] += 1             # if word is not part of the key, then (word, 0) is first added to the dict
    
print(word_counts)

defaultdict(<class 'int'>, {'data': 2, 'science': 1, 'from': 1, 'scratch': 1, 'more': 1})


They can also be useful with `list` or `dict`

In [None]:
dd_list = defaultdict(list)             # list() produces an empty list
dd_list[2].append(1)                    # first add (2, []) to the dict, then [] is appended with 1 to become [1]
#dd_list[2] = [1]

dd_dict = defaultdict(dict)             # dict() produces an empty dict
dd_dict["Joel"]["City"] = "Seattle"     # {"Joel" : {"City": Seattle"}}

print(dd_list)
print(dd_dict)

## Counters

A `Counter` turns a sequence of values into a defaultdict(int)-like object mapping keys to counts:

In [1]:
from collections import Counter
c = Counter(["a", "b", "c", "a"])          # c is (basically) {'a': 2, 'b': 1, 'c': 1}

print(c)

Counter({'a': 2, 'b': 1, 'c': 1})


This gives us a very simple way to solve our word_counts problem:

In [5]:
# recall, document is a list of words
word_counts = Counter(document)
print(word_counts)

Counter({'data': 2, 'science': 1, 'from': 1, 'scratch': 1, 'more': 1})


A `Counter` instance has a `most_common` method that is frequently useful:

In [None]:
# print the 10 most common words and their counts
for word, count in word_counts.most_common(10):
    print(word, count)

## Sets

Another useful data structure is set, which represents a collection of *distinct* elements.

You can define a set by listing its elements between curly braces:

In [None]:
primes_below_10 = {2, 3, 5, 7}

However, that doesn’t work for empty sets, as {} already means “empty dict.”

In that case you’ll need to use `set()` itself:

In [None]:
s = set()
s.add(1)       # s is now {1}
s.add(2)       # s is now {1, 2}
s.add(2)       # s is still {1, 2}
x = len(s)     # equals 2
y = 2 in s     # equals True
z = 3 in s     # equals False

print(f"s = {s}")
print(f"x = {x}")
print(f"y = {y}")
print(f"z = {z}")

The `in` operation is very fast on sets, making it more appropriate for membership tests than a list:

In [None]:
hundreds_of_other_words = ["fill", "in", "tons", "of", "words", "in", "here"]  # required for the below code to run

stopwords_list = ["a", "an", "at"] + hundreds_of_other_words + ["yet", "you"]

print("zip" in stopwords_list)     # False, but have to check every element

stopwords_set = set(stopwords_list)
print("zip" in stopwords_set)      # very fast to check

It's also very convienient to find the *distinct* items in a collection using `set`.

In [None]:
item_list = [1, 2, 3, 1, 2, 3]
num_items = len(item_list)                # 6
item_set = set(item_list)                 # {1, 2, 3}
num_distinct_items = len(item_set)        # 3
distinct_item_list = list(item_set)       # [1, 2, 3]

print(f"num_items = {num_items}")
print(f"num_distinct_items = {num_distinct_items}")
print(f"distinct_item_list = {distinct_item_list}")

## Control Flow

As in most programming languages, you can perform an action conditionally using `if`:

In [None]:
if 1 > 2:
    message = "if only 1 were greater than two..."
elif 1 > 3:
    message = "elif stands for 'else if'"
else:
    message = "when all else fails use else (if you want to)"
    
print(message)

You can also write a *ternary* if-then-else on one line, which we will do occasionally:

In [None]:
x = 1
parity = "even" if x % 2 == 0 else "odd"
print(parity)

Python has a `while` loop:

In [None]:
x = 0
while x < 10:
    print(f"{x} is less than 10")
    x += 1

although more often we’ll use `for` and `in`:

In [None]:
# range(10) is the numbers 0, 1, ..., 9
for x in range(10):
    print(f"{x} is less than 10")

If you need more complex logic, you can use `continue` and `break`:

In [None]:
for x in range(10):
    if x == 3:
        continue  # go immediately to the next iteration
    if x == 5:
        break     # quit the loop entirely
    print(x)

## Truthiness

Booleans in Python work as in most other languages, except that they’re capitalized:

In [None]:
one_is_less_than_two = 1 < 2          # equals True
true_equals_false = True == False     # equals False

print(one_is_less_than_two)
print(true_equals_false)

Python uses the value `None` to indicate a nonexistent value. It is similar to other languages’ `null`:

In [None]:
x = None

# assert will throw an exception if the condition is False, discussed in details later
assert x == None, "this is the not the Pythonic way to check for None"
assert x is None, "this is the Pythonic way to check for None"

There is an subtle but important difference between `==` and `is`
* `==` checks if the contents/values of the two objects are the same
* `is` checks if the identities/memory address of the two objects are the same

In [None]:
list1 = [1, 2]
list1a = list1
print(list1a == list1) # similar to java's equals()
print(list1a is list1) # similar to java's ==

list2 = [1, 2]
print(list2 == list1)
print(list2 is list1)

Python lets you use any value where it expects a Boolean. The following are all “falsy”:
* `False`
* `None`
* `[]` (an empty `list`)
* `{}` (an empty `dict`)
* `""`
* `set()`
* `0`
* `0.0`

Pretty much anything else gets treated as `True`. This allows you to easily use `if` statements to test for empty lists, empty strings, empty dictionaries, and so on.

In [None]:
def some_function_that_returns_a_string():
    return ""

s = some_function_that_returns_a_string()
if s:
    first_char = s[0]
else:
    first_char = ""
    
print(len(first_char))

Python has an `all` function, which takes an iterable and returns `True` precisely when every element is truthy (or when the iterable contains no element), and an `any` function, which returns `True` when at least one element is truthy.

Note: an iterable is any Python object capable of returning its members one at a time, permitting it to be iterated over in a for-loop, e.g., lists, tuples, strings, etc.

In [None]:
print(all([True, 1, {3}]))   # True, all are truthy
print(all([True, 1, {}]))    # False, {} is falsy
print(any([True, 1, {}]))    # True, True is truthy
print(all([]))               # True, no falsy elements in the list
print(any([]))               # False, no truthy elements in the list

## Sorting

Every Python list has a `sort` method that sorts it in place. If you don’t want to mess up your list, you can use the `sorted` function, which returns a new list:

In [None]:
x = [4, 1, 2, 3]
y = sorted(x)     # y is [1, 2, 3, 4], x is unchanged

print(f"x = {x}")
print(f"y = {y}")

x.sort()          # now x is [1, 2, 3, 4]

print(f"x = {x}")

By default, `sort` (and `sorted`) sort a list from smallest to largest based on naively comparing the elements to one another.

If you want elements sorted from largest to smallest, you can specify a `reverse=True` parameter. And instead of comparing the elements themselves, you can compare the results of a function that you specify with `key`:

In [None]:
# sort the list by absolute value from largest to smallest
x = sorted([-4, 1, -2, 3], key=abs, reverse=True)  # is [-4, 3, -2, 1]

print(x)

## List Comprehensions

Frequently, you’ll want to transform a list into another list by choosing only certain elements, by transforming elements, or both. The Pythonic way to do this is with *list comprehensions*:

In [None]:
even_numbers = [x for x in range(5) if x % 2 == 0]  # [0, 2, 4]
squares      = [x * x for x in range(5)]            # [0, 1, 4, 9, 16]
even_squares = [x * x for x in even_numbers]        # [0, 4, 16]

print(f"even_numbers = {even_numbers}")
print(f"squares      = {squares}")
print(f"even_squares = {even_squares}")

You can similarly turn lists into dictionaries or sets:

In [None]:
square_dict = {x: x * x for x in range(5)}  # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
square_set  = {x * x for x in [1, -1]}      # {1}

print(f"square_dict = {square_dict}")
print(f"square_set  = {square_set}")

If you don’t need the value from the list, it’s common to use an underscore as the variable:

In [None]:
zeros = [0 for _ in even_numbers]      # has the same length as even_numbers

print(zeros)

A list comprehension can include multiple `for`s:

In [None]:
pairs = [(x, y)
         for x in range(10)
         for y in range(10)]   # 100 pairs (0,0) (0,1) ... (9,8), (9,9)

print(pairs)

and later `for`s can use the results of earlier ones:

In [None]:
increasing_pairs = [(x, y)                       # only pairs with x < y,
                    for x in range(10)           # range(lo, hi) equals
                    for y in range(x + 1, 10)]   # [lo, lo + 1, ..., hi - 1]

print(increasing_pairs)

## Automated Testing and assert

As data scientists, we’ll be writing a lot of code. How can we be confident our code is correct? One way is to use *automated tests*.

We will be using `assert` statements, which will cause your code to raise an `AssertionError` if your specified condition is not truthy:

In [None]:
# try making the condition False
assert 1 + 1 == 2
assert 1 + 1 == 2, "1 + 1 should equal 2 but didn't"

As you can see in the second case, you can optionally add a message to be printed if the assertion fails.

It’s not particularly interesting to assert that 1 + 1 = 2. What’s more interesting is to assert that functions you write are doing what you expect them to:

In [None]:
def smallest_item(xs):
    return min(xs)  # what if you change it to max(xs)?

assert smallest_item([10, 20, 5, 40]) == 5
assert smallest_item([1, 0, -1, 2]) == -1

Another less common use is to assert things about inputs to functions:

In [None]:
def smallest_item(xs):
    assert xs, "empty list has no smallest item"
    return min(xs)

smallest_item([10, 20, 5, 40])
# smallest_item([])

## Object-Oriented Programming

Like many languages, Python allows you to define classes that encapsulate data and the functions that operate on them. We’ll use them sometimes to make our code cleaner and simpler.

Here we’ll construct a class representing a “counting clicker,” the sort that is used at the door to track how many people have shown up for a meeting.

It maintains a `count`, can be `clicked` to increment the count, allows you to `read_count`, and can be `reset` back to zero. (In real life one of these rolls over from 9999 to 0000, but we won’t bother with that.)

To define a class, you use the `class` keyword and a PascalCase name:

A class contains zero or more member functions. By convention, each takes a first parameter, `self`, that refers to the particular class instance.

Normally, a class has a constructor, named `__init__`. It takes whatever parameters you need to construct an instance of your class and does whatever setup you need:

Notice that the `__init__` method name starts and ends with double underscores. These “magic” methods are sometimes called “dunder” methods (double-UNDERscore, get it?) and represent “special” behaviors.

Note: Class methods whose names start with an underscore are—by convention—considered “private,” and users of the class are not supposed to directly call them. However, Python will not stop users from calling them.

Another such method is `__repr__`, which produces the string representation of a class instance:

And finally we need to implement the public API of our class:

In [None]:
class CountingClicker:
    """A class can/should have a docstring, just like a function"""
    def __init__(self, count = 0):
        self.count = count
        
    def __repr__(self):
        return f"CountingClicker(count={self.count})"
    
    def click(self, num_times = 1):
        """Click the clicker some number of times."""
        self.count += num_times

    def read(self):
        return self.count

    def reset(self):
        self.count = 0

Having defined it, let’s use `assert` to write some test cases for our clicker:

In [None]:
clicker = CountingClicker()
assert clicker.read() == 0, "clicker should start with count 0"
clicker.click()
clicker.click()
assert clicker.read() == 2, "after two clicks, clicker should have count 2"
clicker.reset()
assert clicker.read() == 0, "after reset, clicker should be back to 0"

print(clicker)

We’ll also occasionally create subclasses that inherit some of their functionality from a parent class. For example, we could create a non-reset-able clicker by using Counting Clicker as the base class and overriding the reset method to do nothing:

In [None]:
# A subclass inherits all the behavior of its parent class.
class NoResetClicker(CountingClicker):
    # This class has all the same methods as CountingClicker

    # Except that it has a reset method that does nothing.
    def reset(self):
        pass

In [None]:
clicker2 = NoResetClicker()
assert clicker2.read() == 0
clicker2.click()
assert clicker2.read() == 1
clicker2.reset()
assert clicker2.read() == 1, "reset shouldn't do anything"

## Iterables and Generators

One nice thing about a list is that you can retrieve specific elements by their indices. But you don’t always need this! A list of a billion numbers takes up a lot of memory. If you only want the elements one at a time, there’s no good reason to keep them all around. If you only end up needing the first several elements, generating the entire billion is hugely wasteful.

Often all we need is to iterate over the collection using `for` and `in`. In this case we can create generators, which can be iterated over just like lists but generate their values lazily on demand.

One way to create generators is with functions and the `yield` operator:

In [None]:
def generate_range(n):
    i = 0
    while i < n:
        yield i   # every call to yield produces a value of the generator
        i += 1

The following loop will consume the `yield`ed values one at a time:

In [None]:
for i in generate_range(1e100000000000000000):
    if (i > 10): break
    print(f"i: {i}")

(In fact, `range` is itself lazy, so there’s no point in doing this.)

With a generator, you can even create an infinite sequence:

In [None]:
def natural_numbers():
    """returns 1, 2, 3, ..."""
    n = 1
    while True:
        yield n
        n += 1

although you probably shouldn’t iterate over it without using some kind of `break` logic.

A second way to create generators is by using `for` comprehensions wrapped in parentheses.

Note: **you can only iterate through a generator once**. If you need to iterate through something multiple times, you’ll need to either re-create the generator each time or use a list.

In [None]:
evens_below_20      = (i for i in range(20) if i % 2 == 0)
evens_below_20_list = [i for i in range(20) if i % 2 == 0]

print(evens_below_20)
print(evens_below_20_list)

for i in evens_below_20:    # try to use a list instead
    print(f"i = {i}")
    
print("try again")
for i in evens_below_20:    # try to use a list instead
    print(f"i = {i}")

Such a “generator comprehension” doesn’t do any work until you iterate over it (using for or next). We can use this to build up elaborate data-processing pipelines:

In [None]:
# None of these computations *does* anything until we iterate
data = natural_numbers()
evens = (x for x in data if x % 2 == 0)
even_squares = (x ** 2 for x in evens)
even_squares_ending_in_six = (x for x in even_squares if x % 10 == 6)
# and so on

assert next(even_squares_ending_in_six) == 16
assert next(even_squares_ending_in_six) == 36
assert next(even_squares_ending_in_six) == 196

Not infrequently, when we’re iterating over a list or a generator we’ll want not just the values but also their indices. For this common case Python provides an `enumerate` function, which turns values into pairs `(index, value)`:

In [None]:
names = ["Alice", "Bob", "Charlie", "Debbie"]

# not Pythonic
for i in range(len(names)):
    print(f"name {i} is {names[i]}")

# also not Pythonic
i = 0
for name in names:
    print(f"name {i} is {names[i]}")
    i += 1

# Pythonic
for i, name in enumerate(names):
    print(f"name {i} is {name}")

## Randomness

As we learn data science, we will frequently need to generate random numbers, which we can do with the `random` module:

In [None]:
import random
random.seed(10)  # this ensures we get the same results every time

four_uniform_randoms = [random.random() for _ in range(4)]

# [0.5714025946899135,       # random.random() produces numbers
#  0.4288890546751146,       # uniformly between 0 and 1
#  0.5780913011344704,       # it's the random function we'll use
#  0.20609823213950174]      # most often

print(four_uniform_randoms)

The random module actually produces pseudorandom (that is, deterministic) numbers based on an internal state that you can set with random.seed if you want to get reproducible results:

In [None]:
random.seed(10)         # set the seed to 10
print(random.random())  # 0.57140259469
random.seed(10)         # reset the seed to 10
print(random.random())  # 0.57140259469 again

We’ll sometimes use `random.randrange`, which takes either one or two arguments and returns an element chosen randomly from the corresponding range:

In [None]:
print(random.randrange(10))    # choose randomly from range(10) = [0, 1, ..., 9]
print(random.randrange(3, 6))  # choose randomly from range(3, 6) = [3, 4, 5]

There are a few more methods that we’ll sometimes find convenient. For example, `random.shuffle` randomly reorders the elements of a list:

In [None]:
up_to_ten = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random.shuffle(up_to_ten)
print(up_to_ten)

If you need to randomly pick one element from a list, you can use `random.choice`:

In [None]:
my_best_friend = random.choice(["Alice", "Bob", "Charlie"])

print(my_best_friend)

And if you need to randomly choose a sample of elements without replacement (i.e., with no duplicates), you can use  `random.sample`:

In [None]:
lottery_numbers = range(60)
winning_numbers = random.sample(lottery_numbers, 6)

print(winning_numbers)

To choose a sample of elements with replacement (i.e., allowing duplicates), you can just make multiple calls to `random.choice`:

In [None]:
four_with_replacement = [random.choice(range(10)) for _ in range(4)]
print(four_with_replacement)

## Regular Expressions

Regular expressions provide a way of searching text. They are incredibly useful, but also fairly complicated—so much so that there are entire books written about them.

Here are a few examples of how to use them in Python:

In [None]:
import re

re_examples = [                        # all of these are true, because
    not re.match("a", "cat"),              #  'cat' doesn't start with 'a'
    re.search("a", "cat"),                 #  'cat' has an 'a' in it
    not re.search("c", "dog"),             #  'dog' doesn't have a 'c' in it
    3 == len(re.split("[ab]", "carbs")),   #  split on a or b to ['c','r','s']
    "R-D-" == re.sub("[0-9]", "-", "R2D2") #  replace digits with dashes
    ]

assert all(re_examples), "all the regex examples should be True"

One important thing to note is that `re.match` checks whether the beginning of a string matches a regular expression, while `re.search` checks whether any part of a string matches a regular expression.

## zip and Argument Unpacking

Often we will need to `zip` two or more iterables together. The `zip` function transforms multiple iterables into a single iterable of tuples:

In [None]:
list1 = ['a', 'b', 'c']
list2 = [1, 2, 3]

# zip is lazy, so you have to do something like the following
[pair for pair in zip(list1, list2)]    # is [('a', 1), ('b', 2), ('c', 3)]

If the lists are different lengths, `zip` stops as soon as the first list ends.

You can also “unzip” a list using a strange trick:

In [None]:
pairs = [('a', 1), ('b', 2), ('c', 3)]
letters, numbers = zip(*pairs)

print(letters)
print(numbers)

The asterisk (\*) performs *argument unpacking*, which uses the elements of pairs as individual arguments to `zip`. It ends up the same as if you’d called:

In [None]:
letters, numbers = zip(('a', 1), ('b', 2), ('c', 3))

print(letters)
print(numbers)

It may help to mentally picture *argument unpacking* as removing the container the elements are stored in and then immediately pass the individual elements as arguments to another function.

You can use *argument unpacking* with any function:

In [None]:
def add(a, b): return a + b

print(add(1, 2))      # returns 3
try:
    add([1, 2])
except TypeError:
    print("add expects two inputs")
print(add(*[1, 2]))   # returns 3