# Data Science From Scratch Notes
## Chapter 2: A Crash Course in Python

## Functions

Define functions using `def:`

In [1]:
def double(x):
    """
    Place optional docstring that explains what the functioin does. 
    For example, this function multiplies its input by 2
    """
    return x * 2


In [2]:
double(2)

4

Python functions are *first class*: Functions can be assigned to variables and passed into functions

In [3]:
def apply_to_one(f):
    """Calls the function f with 1 as its argument"""
    return f(1)

my_double = double # refers to the previously defined function
x = apply_to_one(my_double)

In [4]:
x

2

*Lambda Functions*: Short anonymous functions.

The code below asks the lambda function to use 1 as its argument for x

In [5]:
y = apply_to_one(lambda x: x + 4) # equals 5
y

5

Function parameters can be given default arguments.

In [6]:
def my_print(message = "my default message"):
    print(message)
    
my_print("hello") # prints 'hello'
my_print() # prints 'my default message'

hello
my default message


You can specify arguments by name as well.

In [7]:
def full_name(first = "What's-his-name", last = "Something"):
    return first + " " + last

full_name("Joel", "Grus") # "Joel Grus"
full_name("Joel") # "Joel Something"
full_name(last = "Grus") # "What's-his-name Grus"

"What's-his-name Grus"

## Strings

* Strings can be delimited by single or double quotation marks

* Backslashes are used to encode special characters

In [8]:
tab_string = "\t"
tab_string


'\t'

* to actually write backslashes, use raw strings by `r""`

In [9]:
not_tab_string = r"\t"
not_tab_string

'\\t'

* Create multiline strings using three double quotes

In [10]:
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""

multi_line_string

'This is the first line.\nand this is the second line\nand this is the third line'

* *f-string*: substitute values into strings

In [11]:
first_name = "Joel"
last_name = "Grus"

full_name1 = first_name + " " + last_name # string addition
full_name2 = "{0} {1}".format(first_name, last_name) # string.format
full_name3 = f"{first_name} {last_name}" # f-string method

print(full_name1)
print(full_name2)
print(full_name3)

Joel Grus
Joel Grus
Joel Grus


## Exceptions

*Exceptions* are when something goes wrong. Exceptions will cause the program to crash. Handle them using `try` and `except`:

In [12]:
try:
    print(0/0)
except ZeroDivisionError:
    print("cannot divide by zero")

cannot divide by zero


## Lists

The fundamental data structure in Python is the *list*.

*List*: an ordered collection of just about any data structure. Similar to R's lists where they can be nested as well.

Lists are 0-indexed.

In [13]:
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [integer_list, heterogeneous_list, []] # [] denotes an empty list

list_length = len(integer_list) # equals 3
list_sum = sum(integer_list) # equals 6

Use square brackets to get or set the *n*th element of a list:

In [14]:
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

zero = x[0] # equals 0
print(zero)
one = x[1] # equals 1
print(one)
nine = x[-1] # equals 9, 'Pythonic' for last element
print(nine)
eight = x[-2] # equals 8, 'Pythonic' for next-to-last element
print(eight)
x[0] = -1 # now x is [-1, 1, 2, 3, ..., 9]
print(x)

0
1
9
8
[-1, 1, 2, 3, 4, 5, 6, 7, 8, 9]


* Square brackets also allow us to *slice* lists
* slice *i:j* means all elements from *i* (inclusive) to *j* (not inclusive)

In [15]:
first_three = x[:3] # [-1, 1, 2]
print(first_three)

three_to_end = x[3:] # [3, 4, ..., 9]
print(three_to_end)

one_to_four = x[1:5] # [1, 2, 3, 4]
print(one_to_four)

last_three = x[-3:] # [7, 8, 9]
print(last_three)

without_first_and_last = x[1:-1] # [1, 2, 3, ..., 8]
print(without_first_and_last)

copy_of_x = x[:] # [-1, 1, 2, ...,9]
print(copy_of_x)

[-1, 1, 2]
[3, 4, 5, 6, 7, 8, 9]
[1, 2, 3, 4]
[7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8]
[-1, 1, 2, 3, 4, 5, 6, 7, 8, 9]


* You can also slice strings and other "sequential" types

* A slice can take a third argument to indicate its *stride*, which can also be negative
    * Implicitly, the stride argument defaults to 1

In [16]:
every_third = x[::3]
print(every_third)

five_to_three = x[5:2:-1]
print(five_to_three)

[-1, 3, 6, 9]
[5, 4, 3]


The *in* operator checks for list membership:

In [17]:
1 in [1,2,3] # True

True

In [18]:
0 in [1,2,3] # False

False

* There are three ways to concatenate lists together
    1. Use `extend` to modify a list in place
    2. Use list addition if you don't want to modify the list
    3. Use `append` to to append to lists one item at a time

In [19]:
x = [1,2,3]
x.extend([4,5,6])
x

[1, 2, 3, 4, 5, 6]

In [20]:
x = [1,2,3]
y = x + [4,5,6]
y

[1, 2, 3, 4, 5, 6]

In [21]:
x = [1,2,3]
x.append(0)
y = x[-1]
z = len(x)
print(x)
print(y)
print(z)

[1, 2, 3, 0]
0
4


If you know how many elements a list contains, you can *unpack* the list:

In [22]:
x, y = [1, 2] # now x is 1, y is 2
print(x)
print(y)

1
2


Use an underscore for a value if you're going to throw away:

In [23]:
_, y = [1, 2] # now y == 2, didn't care about the first element
y

2

## Tuples

*Tuples*: lists' immutable cousin. Everything you can do to a list except modifying it can be done to a tuple. Tuples are specified using parentheses (or nothing) instead of square brackets:

In [24]:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4

my_list[1] = 3 # my_list is now[1, 3]

try:
    my_tuple[1] = 3
except TypeError:
    print("cannot modify a tuple")

cannot modify a tuple


Tuples are a convenient way to return multiple values from functions:

In [25]:
def sum_and_product(x, y):
    return (x + y), (x * y)

sp = sum_and_product(2, 3) # sp is (5, 6)
print(sp)

s, p = sum_and_product(5, 10) # s is 15 and p is 50
print(s)
print(p)

(5, 6)
15
50


## Dictionaries

Dictionaries associate *values* with *keys* and allow us to quickly retrieve the value corresponding to a given key:

In [26]:
empty_dict = {} # Pythonic
empty_dict2 = dict() # less Pythonic
grades = {"Joel": 80, "Tim": 95} # dictionary literal
grades

{'Joel': 80, 'Tim': 95}

Values for a key in a dictionary are looked up using square brackets:

In [27]:
joels_grade = grades["Joel"] # equals 80
joels_grade

80

We will get a `KeyError` if we ask for a key that's not in the dictionary:

In [28]:
try:
    kates_grade = grades["Kate"]
except KeyError:
    print("no grade for Kate!")

no grade for Kate!


Check for the existence of a key using `in`:

In [29]:
joel_has_grade = "Joel" in grades # True
joel_has_grade

True

In [30]:
kate_has_grade = "Kate" in grades # False
kate_has_grade

False

Dictionaries have a `get` method that returns a default value (instead of raising an exception) when looking up a key that's not in the dicitonary:

In [31]:
joels_grade = grades.get("Joel", 0) # second argument is what you can set as the default value
print(joels_grade)

kates_grade = grades.get("Kate", 0) # Equals 0
print(kates_grade)

no_ones_grade = grades.get("No One") # if second argument is not supplied, the default is None
print(no_ones_grade)

80
0
None


Key/value pairs can be assigned using square brackets:

In [32]:
grades["Tim"] = 99 # replaces the old value
print(grades)

grades["Kate"] = 100 # adds a third entry
print(grades)

num_students = len(grades) # equals 3
print(num_students)

{'Joel': 80, 'Tim': 99}
{'Joel': 80, 'Tim': 99, 'Kate': 100}
3


Dictionaries can be used to represent structured data:

In [33]:
tweet = {
    "user" : "joelgrus",
    "text" : "Data Science is Awesome",
    "retweet_count" : 100,
    "hashtags" : ["#data", "#science", "datascience", "awesome", "yolo"]
}

tweet

{'user': 'joelgrus',
 'text': 'Data Science is Awesome',
 'retweet_count': 100,
 'hashtags': ['#data', '#science', 'datascience', 'awesome', 'yolo']}

 In addition to looking for specific keys, we can also look at all of the keys:

In [34]:
tweet_keys = tweet.keys() # iterable for the keys
print(tweet_keys)

tweet_values = tweet.values() # iterable for the values
print(tweet_values)

tweet_items = tweet.items() # iterable for the (key, value) tuples
print(tweet.items)



dict_keys(['user', 'text', 'retweet_count', 'hashtags'])
dict_values(['joelgrus', 'Data Science is Awesome', 100, ['#data', '#science', 'datascience', 'awesome', 'yolo']])
<built-in method items of dict object at 0x111387040>


In [35]:
"user" in tweet_keys # Check for existence of "user". True, but not Pythonic


True

In [36]:
"user" in tweet # Pythonic way of checking for keys

True

In [37]:
"joelgrus" in tweet_values # True (slow but only way to check values)

True

We cannot use lists as keys (keys must be "hashable"). If we need a multipart key, use a tuple or turn the key into a string.

### defaultdict

`defaultdict`: When you try to look up a key it doesn't contain, it firsts adds a value for it using a zero-argument function you provided. Use `defaultdicts` by importing them from `collections`:

In [38]:
from collections import defaultdict

# word_counts = defaultdict(int) # int() produces 0
# for word in document:
#     word_counts[word] += 1

`defaultdicts` can also be used with `list`, `dict`, and lambda functions

In [39]:
dd_list = defaultdict(list) # list() produces an empty list
print(dd_list)
dd_list[2].append(1) # now dd_list contains {2: [1]}
print(dd_list)

defaultdict(<class 'list'>, {})
defaultdict(<class 'list'>, {2: [1]})


In [40]:
dd_dict = defaultdict(dict) # dict() produces an empty dict
print(dd_dict)
dd_dict["Joel"]["City"] = "Seattle" # {"Joel" : {"City": "Seattle"}}
print(dd_dict)

defaultdict(<class 'dict'>, {})
defaultdict(<class 'dict'>, {'Joel': {'City': 'Seattle'}})


In [41]:
dd_pair = defaultdict(lambda: [0, 0])
print(dd_pair)
dd_pair[2][1] = 2
print(dd_pair)

defaultdict(<function <lambda> at 0x11138ba60>, {})
defaultdict(<function <lambda> at 0x11138ba60>, {2: [0, 2]})


## Counters

A `Counter` converts a sequence of values into a `defaultdict(int)`-like object mapping keys to counts:

In [42]:
from collections import Counter
c = Counter([0, 1, 2, 0]) # c is {0: 2, 1: 1, 2: 1}
print(c)

Counter({0: 2, 1: 1, 2: 1})


We can also use the `most_common` method from a `Counter` instance:

In [43]:
# print the 2 most common numbers and their counts in c
c = Counter([0, 1, 2, 0, 0, 1, 1, 1, 1, 2])
print(c)

for num, count in c.most_common(2):
    print(num, count)

Counter({1: 5, 0: 3, 2: 2})
1 5
0 3


## Sets



A set is another Python data structure. It represents a collection of *distinct* elements. Define a set by listing its elements between curly braces:

In [44]:
primes_below_10 = {2, 3, 5, 7}

Keep in mind that you can't use `{}` to denote an empty set. Recall in the **Dictionaries** section that assigning `{}` to a variable denotes an `empty dict`. In order to create an empty set, we use `set()`:

In [45]:
s = set()
print(s)
s.add(1) # s is now {1}
print(s)
s.add(2) # s is now {1, 2}
print(s)
s.add(2) # s is still {1, 2} because sets represent a collection of distinct elements
print(s)

set()
{1}
{1, 2}
{1, 2}


Sets are used for two main reasons:

   1. `in` is a very fast operation on sets. Thus it is more appropriate to use a set to conduct a membership test than using a list.
   2. Use sets to find the *distinct* items in a collection.

In [46]:
# Using in
# stopwords_list = ["a", "an", "at"] + hundreds_of_other_words + ["yet", "you"]

# "zip" in stopwords_list # False, but have to check every element

# stopwords_set = set(stopwords_list)
# "zip" in stopwords_set # very fast to check

In [47]:
# Finding distinct items
item_list = [1, 2, 3, 1, 2, 3]
print(item_list)
num_items = len(item_list) # 6
item_set = set(item_list) # {1, 2, 3}
print(item_set)
num_distinct_items = len(item_set) # 3
distinct_item_list = list(item_set) # [1, 2, 3]
print(distinct_item_list)

[1, 2, 3, 1, 2, 3]
{1, 2, 3}
[1, 2, 3]


## Control Flow

You can perform an action conditionally using an `if` statement:

In [50]:
if 1 > 2:
    message = "if only 1 were greater than two..."
elif 1 > 3:
    message = "elif stands for 'else if'"
else:
    message = "when all else fails use els (if you want to)"

*Ternary* if-then-else statements can be made on one line as well:

In [51]:
parity = "even" if x % 2 == 0 else "odd"

Python's implementation of `while` loops:

In [52]:
x = 0
while x < 10:
    print(f"{x} is less than 10")
    x += 1

0 is less than 10
1 is less than 10
2 is less than 10
3 is less than 10
4 is less than 10
5 is less than 10
6 is less than 10
7 is less than 10
8 is less than 10
9 is less than 10


But `for` and `in` are more commonly used:

In [53]:
# range(10) is the numbers 0, 1, 2,..., 9
for x in range(10):
    print(f"{x} is less than 10")

0 is less than 10
1 is less than 10
2 is less than 10
3 is less than 10
4 is less than 10
5 is less than 10
6 is less than 10
7 is less than 10
8 is less than 10
9 is less than 10


More complex logic will use `continue` and `break`:

In [54]:
for x in range(10):
    if x == 3:
        continue # go immediately to the next iteration
    if x == 5:
        break # quit the loop entirely
    print(x)

0
1
2
4


## Truthiness

Booleans are capitalized:

In [55]:
one_is_less_than_two = 1 < 2 # equals True
print(one_is_less_than_two)
true_equals_false = True == False # equals False
print(true_equals_false)

True
False


In R, `NA` is used to indicate a nonexistent value. In Python, the equivalent is the value `None`:

In [60]:
x = None
assert x == None, "this is not the Pythonic way to check for None"
assert x is None, "this is the Pythonic way to check for None"

In [62]:
x is None # prints True

True

Python lets you use any value where it expects a Boolean. The following are treated as False:

   * `False`
   * `None`
   * `[]` (an empty `list`)
   * `{}` (an empty `dict`)
   * ""
   * set()
   * 0
   * 0.0

This allows us to easily use `if` statements to test for empty strings, empty dictionaries, etc. It also sometime causes bugs.

Python has an `all` function which takes an iterable and returns `True` when every element is truthy.
Python has an `any` function, which returns `True` when at least one element is truthy:

In [69]:
all([True, 1, 3]) # True, all are truthy

True

In [70]:
all([True, 1, {}]) # False, {} is falsy

False

In [71]:
any([True, 1, {}]) # True, True is truthy

True

In [75]:
all([]) # True, no falsy elements in the list

True

In [76]:
all([[],[],[]]) # False, list of empty lists. The empty lists are falsy

False

In [77]:
any([]) # False, no truthy elements in the list

False

#### Sorting

Every Python list has a `sort` method that sorts it in place. If you don't want to mess up your list, you can use the `sorted` functions, which returns a new list:

In [78]:
x = [4, 1, 2, 3]
y = sorted(x) # y is [1, 2, 3, 4], x is unchanged
print(x)
print(y)

[4, 1, 2, 3]
[1, 2, 3, 4]


In [79]:
x.sort() # now x is [1, 2, 3, 4]
print(x)

[1, 2, 3, 4]


As we can see, the `sort` and `sorted` sort a list from smallest to largest by default. We specify a `reverse=True` parameter to sort from largest to smallest. we can also use the `key` parameter to specify the way to sort:

In [81]:
# sort the list by absolute value from largest to smallest
x = sorted([-4, 1, -2, 3], key=abs, reverse=True) # is [-4, 3, -2, 1]
print(x)

[-4, 3, -2, 1]


## List Comprehensions

We use *list comprehensions* to transform a list into another list. These transformations can be choosing specific elements, transforming elements, or both.

In [84]:
even_numbers = [x for x in range(5) if x % 2 == 0] # [0, 2, 4]
print(even_numbers)

squares = [x * x for x in range(5)] # [0, 1, 4, 9, 16]
print(squares)

even_squares = [x * x for x in even_numbers] # [0, 4, 16]
print(even_squares)

[0, 2, 4]
[0, 1, 4, 9, 16]
[0, 4, 16]


Similarly, we can turn lists into dictionaries or sets:

In [87]:
square_dict = {x: x * x for x in range(5)} # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
print(square_dict)

square_set = {x * x for x in [1, -1]} # {1}
print(square_set)

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
{1}


Use an underscore as the variable if you don't need the value from the list:

In [88]:
zeros = [0 for _ in even_numbers] # has the same length as even_numbers
print(zeros)

[0, 0, 0]


A list comprehension can include multiple `fors`:

In [89]:
pairs = [(x, y)
        for x in range(10)
        for y in range(10)] # 100 pairs (0,0), (0,1) ... (9,8), (9,9)
print(pairs)

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (2, 9), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (3, 7), (3, 8), (3, 9), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (4, 7), (4, 8), (4, 9), (5, 0), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (5, 7), (5, 8), (5, 9), (6, 0), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (6, 7), (6, 8), (6, 9), (7, 0), (7, 1), (7, 2), (7, 3), (7, 4), (7, 5), (7, 6), (7, 7), (7, 8), (7, 9), (8, 0), (8, 1), (8, 2), (8, 3), (8, 4), (8, 5), (8, 6), (8, 7), (8, 8), (8, 9), (9, 0), (9, 1), (9, 2), (9, 3), (9, 4), (9, 5), (9, 6), (9, 7), (9, 8), (9, 9)]


Later `fors` can use the rsults of earlier ones:

In [90]:
increasing_pairs = [(x, y)
                   for x in range(10)
                   for y in range(x + 1, 10)]
print(increasing_pairs)

[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (2, 3), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (2, 9), (3, 4), (3, 5), (3, 6), (3, 7), (3, 8), (3, 9), (4, 5), (4, 6), (4, 7), (4, 8), (4, 9), (5, 6), (5, 7), (5, 8), (5, 9), (6, 7), (6, 8), (6, 9), (7, 8), (7, 9), (8, 9)]


## Automated Testing and `assert`

We can be confident our code is correct using either *types* or *automated tests*.

We will test our code using `assert` statements, which will cause our code to raise an `AssertionError` if the specified condition is not truthy:

In [92]:
assert 1 + 1 == 2
assert 1 + 1 == 2, "1 + 1 should equal 2 but didn't"

The optional message in the `assert` statement will be printed if the assertion fails. We should assert that functions we write are doing what we expect them to do:

In [94]:
def smallest_item(xs):
    return min(xs)

assert smallest_item([10, 20, 5, 40]) == 5
assert smallest_item([1, 0, -1, 2]) == -1

We can also assert things about inputs to functions:

In [96]:
def smallest_item(xs):
    assert xs, "empty list has no smallest item"
    return min(xs)

## Object-Oriented Programming

