# Chapter 3) Built-in Data Structures, Functions, and Files

The chapter covers the built-in data funcitonality of Python. Libraries like pandas and NumPy build off the material in this chapter. We will start with common data structures:

* tuples
* lists
* dicts
* sets

Then custom functions will be covered, followed by interacting with the local hard drive.

## 3.1 Data Structures and Sequences

### Tuple

A tuple is a fixed-length sequence of Python objects. Tuples are immutable. Tuples can be created with a comma-separated sequence of values:

In [1]:
tup = 1, 2, 3
tup

(1, 2, 3)

Enclosing certain values of a tuple with parentheses can be useful when making more complex objects:

In [2]:
nested_tup = (1, 2, 3), (4, 5)
nested_tup

((1, 2, 3), (4, 5))

You can also convert any sequence of objects into a tuple by using the `tuple()` function.

Elements of a tuple an be accessed using brackets (`[]`). Note that elements are indexed from zero in Python.

In [6]:
nested_tup[1]

(4, 5)

Note that tuples themselves are immutable, but if an element of a tuple is mutable then it can be mutated in place:

In [7]:
tup = tuple(['foo', [1, 2], True])
tup[2] = False

TypeError: 'tuple' object does not support item assignment

In [10]:
tup[1].append(3)
tup

('foo', [1, 2, 3, 3, 3], True)

Tuples can be concated like strings using the `+` operator:

In [11]:
(1, 2, 3) + (4, 5)

(1, 2, 3, 4, 5)

When multiplying a tuple by `x`, then `x` copies of the tuple will be concatenated together. Note that this is *very* different than R's vectorized nature.

In [13]:
(1, 2, 3) * 2

(1, 2, 3, 1, 2, 3)

#### Unpacking tuples

Python will try to unpack tuples when they are assigned to a tuple-like expression of variables:

In [17]:
tup = 1, 2, 3
a, b, c = tup
a

1

A common use of variable unpacking is when iterating over a sequence of tuples/lists:

In [20]:
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for a, b, c in seq:
    print('a={0}, b={1}, c={2}'.format(a, b, c)) # note the use of format

a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9


#### Tuple methods

Since tuples are immutable, there are few methods available. A useful one is `count()`, which counts the occurrences of a value:

In [21]:
tup = (1, 2, 3, 4, 5, 5, 5)
tup.count(5)

3

### List

Lists are also sequences of Python objects, but they are variable-length and can be modified in-place. They can be defined with `[]` or `list()`:

In [24]:
lst_a = [1, 2, 3]
lst_b = list((1, 2, 3))
lst_a == lst_b

True

Lists and tuples are semantically similar. The `list()` function is commonly used to create an iterator:

In [27]:
gen = range(10)
gen

range(0, 10)

In [26]:
list(gen)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### Adding and removing elements

The `append()` method can be used to add elements to the end of a list:

In [29]:
lst = ['a', 'b', 'c']
lst.append('d')
lst

['a', 'b', 'c', 'd']

The `insert()` method can be usd to insert an element in a particular position of a list:

In [30]:
lst = ['a', 'c', 'd']
lst.insert(1, 'b')
lst

['a', 'b', 'c', 'd']

Elements of a list can be removed with `remove()`. Note that only the first occurrence of the value will be removed:

In [31]:
lst = ['a', 'b', 'c', 'b']
lst.remove('b')
lst

['a', 'c', 'b']

You can check if a list contains a value using `in`:

In [32]:
'a' in lst

True

In [33]:
'a' not in lst

False

Note that `in` is slow with lists relative to dicts and sets, which will be introduced soon.

#### Combining and cncatenating lists

Lists can also be concatenated with the `+` operator. The `extend()` method can be used to append multiple elements at once.

In [36]:
['a', 'b', 'c'] + ['d', 'e']

['a', 'b', 'c', 'd', 'e']

In [35]:
lst = ['a', 'b', 'c']
lst.extend(['d', 'e'])
lst

['a', 'b', 'c', 'd', 'e']

Note that `extend()` and `append()` are usually faster than `+` because they are modifying the list in place rather than writing a completely new list.

#### Sorting

The `sort()` method can used to sort a list in place:

In [38]:
lst = [1, 3, 2]
lst.sort()
lst

[1, 2, 3]

#### Binary search and maintaining a sorted list

The `bisect` module implements a binary search to insert into a sorted list. `bisect.bisect()` identifies the location where an element woud be inserted to keep it sorted, while `bisect.insort()` actually inserts the element.

In [40]:
import bisect
lst = [1, 2, 3, 4, 5, 7]

In [43]:
bisect.bisect(lst, 6)

2

In [44]:
lst

[1, 2, 3, 4, 5, 7]

In [45]:
bisect.insort(lst, 6)

In [46]:
lst

[1, 2, 3, 4, 5, 6, 7]

Note that this is all in-place, making it fast.

#### Slicing

Most sequence types can be subsetted using the "slicing" notation. For example:

In [47]:
lst = ['a', 'b', 'c', 'd', 'e']
lst[2:3]

['c']

Note that while the first element is included, the final element in the range is excluded. This is very different than R.

### Built-in sequence functions

#### enumerate

When iterating over a sequence, it is useful to know the index of the sorted column. This could be done with:

In [51]:
i = 0
for x in [1, 2, 3]:
    print('Current index is ' + str(i))
    i = i + 1

Current index is 0
Current index is 1
Current index is 2


But because this is such a common task, we can also do:

In [50]:
for i, x in enumerate([1, 2, 3]):
    print('Current index is ' + str(i))

Current index is 0
Current index is 1
Current index is 2


#### sorted

The `sorted()` function can be used to return a sorted list from the elements of any sequence:

In [53]:
lst = [2, 3, 5, 4, 1]
sorted(lst)

[1, 2, 3, 4, 5]

In [54]:
tup = (2, 3, 5, 4, 1)
sorted(tup)

[1, 2, 3, 4, 5]

#### zip

The `zip()` function pairs up multiple sequences to create a list of tuples:

In [63]:
seq_1 = ['foo', 'bar', 'zoo']
seq_2 = [1, 2, 3]
list(zip(seq_1, seq_2))

[('foo', 1), ('bar', 2), ('zoo', 3)]

#### reversed

`reversed()` is a *generator* like `range()` and `zip()`. It iterates over the elements in a sequence in reverse order.

In [65]:
list(range(5))

[0, 1, 2, 3, 4]

In [66]:
list(reversed(range(5)))

[4, 3, 2, 1, 0]

### dict

dict is the most important built-in data structure. Within programming, it is more commonly known as a *hash map* or *associative array*. A `dict` is flexible-in-size, meaning it can be mutated. It is a collection of key-value pairs.

A common way of creating a dict is the use of curly braces and colons:

In [86]:
dict_1 = {'a' : 'apple', 'b' : 'banana', 'c' : 'carrot'}
dict_1

{'a': 'apple', 'b': 'banana', 'c': 'carrot'}

Note that the value of each key-value pair does not need to be of the same type:

In [68]:
dict_2 = {'a' : 'apple', 'b' : [1, 2, 3, 4]}
dict_2

{'a': 'apple', 'b': [1, 2, 3, 4]}

Many of the same functionalities from `lists` and `tuples` are available for `dicts`:

In [87]:
dict_1['d'] = 'doritos'
dict_1

{'a': 'apple', 'b': 'banana', 'c': 'carrot', 'd': 'doritos'}

In [88]:
dict_1['b']

'banana'

You can check if a `dict` contains a key by using the `in` operator:

In [72]:
'c' in dict_1

True

In [73]:
'c' in dict_2

False

You can delete values by either using `del` or `pop`:

In [89]:
del dict_1['d']
dict_1

{'a': 'apple', 'b': 'banana', 'c': 'carrot'}

In [90]:
dict_1.pop('a')
dict_1

{'b': 'banana', 'c': 'carrot'}

The `keys` and `values` methods give iterators of the `dict` keys and values:

In [83]:
list(dict_1.keys())

['b', 'c']

In [84]:
list(dict_1.values())

['banana', 'carrot']

`dicts` can be merged into one another using the `update()` method:

In [91]:
dict_1.update({'d' : 'doritos', 'e' : 'eggs'})
dict_1

{'b': 'banana', 'c': 'carrot', 'd': 'doritos', 'e': 'eggs'}

#### Creating dicts from sequences

A `dict` is a collection of key-value pairs where `key` and `value` are essentially tuples. So the `dict` function can be used to create a `dict` via two `tuples`.

In [95]:
dict_1 = dict(zip((1, 2, 3), ('a', 'b', 'c')))
dict_1

{1: 'a', 2: 'b', 3: 'c'}

### set

A `set` is an unordered collection of unique elements. `sets` can be created using the `set()` function or via *set literal* with curly braces:

In [96]:
set_1 = set([1, 2, 3])
set_2 = {1, 2, 3}
set_1 == set_2

True

`sets` support mathematical set operations:

In [97]:
a = {1, 2, 3, 4, 5}
b = {1, 3, 5, 7, 9}

In [98]:
a.union(b)

{1, 2, 3, 4, 5, 7, 9}

In [99]:
a.intersection(b)

{1, 3, 5}

### list, set, and dict comprehensions

*list comprehensions* are loved by Python users. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter in one concise expression. The general form is:

```python
[expr for val in collection if condition]
```

which is equivalent to:

```python
result = []
for val in collection:
    if condition:
        result.append(expr)
```

Here is an example:

In [100]:
strings = ['a', 'banana', 'cat', 'doritos', 'egg']

In [103]:
[x.upper() for x in strings if len(x) == 3]

['CAT', 'EGG']

Note the only the elements meeting the condition are passed through the list comprehension.

`set` and `dict` comprehensions are a natural extension. Here is a dict comprehension:

```python
{kep-expr : value-expr for value in collection if condition}
```

And a set comprehension:

```python
{expr for value in collection if condition}
```

#### Nested list comprehensions

Nested comprehensions can be helpful when working with nested objects:

In [111]:
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],
            ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]

Imagine we want to find all the names with at least one 'e'. We could do this with a nested for loop:

In [113]:
names_with_e = []
for x in all_data:
    for y in x:
        if y.count('e') + y.count('E') >= 1:
            names_with_e.append(y)
names_with_e

['Emily', 'Michael', 'Steven', 'Javier']

Or, we could use a list comprehension inside a for loop:

In [114]:
names_with_e = []
for x in all_data:
    sub_names_with_e = [y for y in x if y.count('e') + y.count('E') >= 1]
    names_with_e.extend(sub_names_with_e)
names_with_e

['Emily', 'Michael', 'Steven', 'Javier']

Or, even better, we can do a *nested comprehension*:

In [116]:
[y for x in all_data for y in x if y.count('e') + y.count('E') >= 1]

['Emily', 'Michael', 'Steven', 'Javier']

## 3.2) Functions

Functions are extrememly important for code organization and reuse in Python. If you are going to use the code more than once, it's good to write a function for it. Functions also make code more readable.

Functions are declared with the `def` keyword and results are returned with the `return` keyword:

```python
def my_function(x, y):
    if x > y:
        return('X is greater')
    else:
        return('X is not greater')
```

If there are no return keywords, the function will return `None`.

### Namespaces, Scope, and Local Functions

Any variable assigned within a function are assigned to the *local* namespace. This namespace is specific to the function only - its objects are not available outside of the function and disappear when the function completes.

It is possible to define variables to the *global* namespace within a function, but the `global` keyword must be used:

In [124]:
def func():
    global a
    a = ['apple', 'banana']

In [126]:
func()

In [127]:
a

['apple', 'banana']

### Returning Multiple Values

Multiple values can be returned from a function in Python:

In [129]:
def func(x):
    return x, x + 1, x + 2
a, b, c = func(1)

In actuality, `func()` is returning a single tuple that is then being unpacked.

### Functions Are Objects

Python functions being objects brings ease to many tasks. For example, imagine we have the following list:

In [130]:
states = ['    Alabama ', 'Georgia!', 'Georgia', 'georgia', 
          'FLorida', 'south     carolina##', 'West virginia?']

A lot of things need to happen to clean this list of states:

1. Stripping whitespace
2. Removing punctuation
3. Proper capitalization

We can create a function to do this:

In [131]:
import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result

When run, the result will look like this:

In [132]:
clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South     Carolina',
 'West Virginia']

However, an alternative approach would be to make a list of the functions. We can do this because functions are objects:

In [133]:
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

In [135]:
clean_strings(states, clean_ops)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South     Carolina',
 'West Virginia']

This *functional* approach is generic and reusable. Using functions as arguments to other functions is popular in Python, and is the basis for the `map()` function:

In [137]:
for x in map(remove_punctuation, states):
    print(x)

    Alabama 
Georgia
Georgia
georgia
FLorida
south     carolina
West virginia


### Anonymous (Lambda) Functions

*Anonymous* or *lambda* functions consist of a single statement, the result of which is the return value. They are defined with the *lambda* keyword, which means "we are defining an anonymous function." The following two functions are equivalent:

In [140]:
def short_function(x):
    return x * 2

equiv_anon = lambda x: x * 2

In [144]:
equiv_anon(2) == short_function(2)

True

These are more commonly known as *lambda* functions. These can be useful for data analysis. Here is an example that sorts a set by the number of characters in each element:

In [147]:
strings = ['apple', 'banana', 'clementine', 'doritos', 'egg', 'falafel']
strings.sort(key = lambda x: len(x))
strings

['egg', 'apple', 'banana', 'doritos', 'falafel', 'clementine']

### Currying: Partial Argument Application

*Currying* is the derivation of new functions from existing ones using partial argument application. Here is an example:

In [148]:
def add_numbers(x, y):
    return x + y

add_five = lambda y: add_numbers(5, y)

add_five(10)

15

In this case, the second argument to `add_five()` is said to be *curried*. This is simply defining a new function that calls an existing function. The built-in `functools` module has a `partial()` function to simplify this:

In [149]:
from functools import partial

add_five = partial(add_numbers, 5)

add_five(10)

15

### Generators

An important part of Python is the *iterator protocol*, which is a way to make objects iterable. For example, when you write `python for key in some_dict:`, the Python interpreter attempts to create an iterator out of `some_dict`:

```python
iter(some_dict)
```

An *iterator* is any Python object that will yield objects when used in a context like a `for` loop. A *generator* is a concise way to construct an iterator object. Generators take an input, and return a sequence of results lazily: an element is not created until it's requested. In order to create an iterator, use the `yield` keyword:

In [163]:
def squares(n = 10):
    print('Generator squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2

In [164]:
gen = squares()
gen

<generator object squares at 0x7f165c4eae60>

In [165]:
for x in gen:
    print(x, end = ' ')

Generator squares from 1 to 100
1 4 9 16 25 36 49 64 81 100 

Other common generators include `range()`.

#### Generator expressions

Generator expressions are similar to list, set, and dict comprehensions:

```python
gen = (x ** 2 for x in range(100))
```

is equivalent to:

```python
def _make_gen():
    for x in range(100):
        yield x ** 2
gen = _make_gen()
```

#### itertools module

The `itertools` module has many generators for common data algorithms. For example, `groupby()` takes any sequence and a function, grouping consecutive elements in the sequence by return value of the function. Here is an example:

In [167]:
from itertools import groupby

first_letter = lambda x: x[0]

names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

for letter, names in groupby(names, first_letter):
    print(letter, list(names))

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


### Errors and Exception Handling

Handling errors and exceptions gracefully can be important for programs. Normally, the `float()` function will convert a string to a float assuming the string is of the proper type:

In [172]:
float('1.234')

1.234

In [173]:
float('1.2.3.4')

ValueError: could not convert string to float: '1.2.3.4'

If we wanted to fail gracefully, we could write our own function:

In [174]:
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

In [175]:
attempt_float('1.234')

1.234

In [176]:
attempt_float('1.2.3.4')

'1.2.3.4'

We can be more specific around the types of errors we have exceptions for:

In [177]:
def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return x

In [178]:
attempt_float('1.2.3.4')

'1.2.3.4'

In [179]:
attempt_float((1, 2, 3, 4))

TypeError: float() argument must be a string or a number, not 'tuple'

#### Exceptions in IPython

If an exception/error is raised in IPython, traceback will show to determine where the error occurred and why. This does not happen in the normal Python interpreter.

## 3.1) Files and the Operating System

While `pandas` has its own file-reading functionality, Python has quality file control making it popular for file munging.

To open a file for reading/writing, use the `open()` function:

In [180]:
path = 'hello_world.txt'
file = open(path)

This opens the file in read-only mode.

You can iterate over the lines of the file.

In [184]:
lines = [x.rstrip() for x in open(path)]

In [185]:
lines

['Hello World!',
 '',
 'My name is Mark. What is your name?',
 '',
 'Thanks,',
 'Mark']

It is important to close the file when you're done with it to release it's memory resources back to the system.

In [186]:
file.close()

But note that we still have our `lines` object:

In [187]:
lines

['Hello World!',
 '',
 'My name is Mark. What is your name?',
 '',
 'Thanks,',
 'Mark']

Files can be automatically closed with a `with` statement:

In [189]:
with open(path) as file:
    lines = [x.rstrip() for x in file]

For readable files, the most common methods are:

* `read` - returns a certain number of characters from the file
* `seek` - changes the file position to the indicated byte in the file 
* `tell` - returns current file position as an integer

## 3.4) Conclusion

Now with a decent understanding of the basics, we can move on to learn about NumPy and array-oriented computing.