# Chapter 3 Built-in Data Structure, Function, and Files

pandas and NumPy, while adding advanced functionality for larger datasets, are designed to be used together with Python's built-in data manipulation tools

## Data Structures and Sequences

### Tuple
a **Tuple** is a fixed-length, immutable sequence of python objects.

In [1]:
#simple tuple created from a sequence of values
tup = 4, 5, 6, "Why am I here?"
print(tup)

(4, 5, 6, 'Why am I here?')


In [2]:
#with more complicated expressions, it's good to enclose the values in parenthesis, 
# such as when creating a tuple of tuples
nestedTup = (4, 5, 6), (7, 8)
print(nestedTup)

((4, 5, 6), (7, 8))


In [3]:
#you can convert any sequence or iterator to a tuple by invoking tuple
helloTup = tuple("hello world!")
print(helloTup)

('h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!')


In [4]:
#elements are accessed with square brakets
helloTup[7]

'o'

In [5]:
#tuples are immutale, but the objects inside them arent
myTup = ('foo', [1, 2, 3], True)
myTup[1].append(4)
print(myTup)

('foo', [1, 2, 3, 4], True)


In [6]:
#you can even add tuples (you can add most things, actually)
helloTup + myTup

('h',
 'e',
 'l',
 'l',
 'o',
 ' ',
 'w',
 'o',
 'r',
 'l',
 'd',
 '!',
 'foo',
 [1, 2, 3, 4],
 True)

In [7]:
myTup * 4

('foo',
 [1, 2, 3, 4],
 True,
 'foo',
 [1, 2, 3, 4],
 True,
 'foo',
 [1, 2, 3, 4],
 True,
 'foo',
 [1, 2, 3, 4],
 True)

#### Unpacking Tuples

In [8]:
#if you try to assign to a tuple like expression of variables, Python will attempt to unpack the value on the 
# righthandside of the equals sign
tup = (4, 5, 6)
a, b, c = tup
print(a, b, c)

4 5 6


In [9]:
#this can be used to easily swap values
print("before swap: ", a, b)
a, b = b, a
print("after swap: ", a, b)

before swap:  4 5
after swap:  5 4


In [10]:
#the rest keyword can be used to capture unnessessary values from function or tuple returns
#NOTE rest doesn't matter, in fact many programmers will use *_
values = 1, 2, 3, 4, 5
a, b, *rest = values
print("a and b: ", a, b)
print("rest: ", rest)

a and b:  1 2
rest:  [3, 4, 5]


## Lists
lists are variable-length, mutable sequences. you can define them using square brackets or my using the *list* type function

In [11]:
a_list = [2, 3, 7, None]
print("a_list: ", a_list)

tup = ('foo', 'bar', 'bazz')
b_list = list(tup)
print("b_list: ", b_list)
b_list[1] = "peekaboo"
print("b_list (modified): ", b_list)

a_list:  [2, 3, 7, None]
b_list:  ['foo', 'bar', 'bazz']
b_list (modified):  ['foo', 'peekaboo', 'bazz']


the *list* function is often used in data processing as a way to **materialize an iterator or generator expression**

In [12]:
#range produces a generator object
gen = range(10)
print("gen: ", gen)

print(list(gen))

gen:  range(0, 10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [13]:
#adding to end of element
b_list.append('dwarf')
print(b_list)

['foo', 'peekaboo', 'bazz', 'dwarf']


In [14]:
#inserting into a list
b_list.insert(1, 'red')
print(b_list)

['foo', 'red', 'peekaboo', 'bazz', 'dwarf']


In [15]:
#removing a particular element from an index
b_list.pop(2)

'peekaboo'

In [16]:
#pop removes the last element (like a stack datastructure) by default
b_list.pop()

'dwarf'

In [17]:
#remove an element specified by value
b_list.remove('foo')
print(b_list)

['red', 'bazz']


#### Concatenting and combining lists

In [18]:
#you can add to lists with the addition operator
print([4, None, 'foo'] + [7, 8, (2, 3)])

#if you have a list alreading decined then you can call the extend method
#NOTE: extend is much more efficient since it does not create a new list object
b_list.extend([3, 4, 5, 6, 'seven'])
b_list.extend((4, 5, 6))
print(b_list)

[4, None, 'foo', 7, 8, (2, 3)]
['red', 'bazz', 3, 4, 5, 6, 'seven', 4, 5, 6]


#### sorting

In [19]:
#you can sort a list in-place by callint it's sort function
sortedList = [7, 3, 9, 2, 81, 1, 1, 1, 9, 2, 3, 4, 5]
print(sortedList)
sortedList.sort()
print(sortedList)

[7, 3, 9, 2, 81, 1, 1, 1, 9, 2, 3, 4, 5]
[1, 1, 1, 2, 2, 3, 3, 4, 5, 7, 9, 9, 81]


In [20]:
#you can even pass in  a secondary sort key, which is a function that provides a value used to sort the objects
b = ['saw', 'small', 'he', 'foxes', 'six']
b.sort()
print("sort by numerical (equivalent) value in default string sorting:", b)

#broke due to first sort...
b.sort(key=len)
print(b)

sort by numerical (equivalent) value in default string sorting: ['foxes', 'he', 'saw', 'six', 'small']
['he', 'saw', 'six', 'foxes', 'small']


#### Binary Search and Maintaining a Sorted List
the *bisect* module implements binary search and insertion into a sorted list

In [21]:
import bisect
c = [1, 2, 2, 2, 3, 4, 7]

#bisect.bisect finds the location where an element should be inserted to keep it sorted
bisect.bisect(c, 2)

4

In [22]:
#bisect.insort actually inserts the element into that location
bisect.insort(c, 6)
print(c)

[1, 2, 2, 2, 3, 4, 6, 7]


## Slicing (IMPORTANT)
you can select selections of most sequences types by isng slice notation. slicing notation consists of

**start:stop:step**

NOTE: while python's selection notation is **Excluside** pandas is **Inclusive**

In [23]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
#note how the stop index is skipped
print(seq[1:5])

[2, 3, 7, 5]


both the **Start** and **Stop** can be ommited, in which cas they default to the start and end of the sequence, respectively

In [24]:
print(seq[3:])
print(seq[:3])

[7, 5, 6, 0, 1]
[7, 2, 3]


Negative indices slice the sequence relative to the end

In [25]:
print(seq[-4:])
print(seq[-6:-2])

[5, 6, 0, 1]
[3, 7, 5, 6]


you can even provide a step argument (default 1) which can be used to skip arguments, or even go backwards

In [26]:
#skip every other sequence
print(seq[::2])
#permute backwards through the list
print(seq[::-1])

[7, 3, 5, 0]
[1, 0, 6, 5, 7, 3, 2, 7]


## Built in Sequence Functions

### Enumerate
enumerate allow you to keep track of the index of the current item

In [27]:
aList = []

#do it yousef approach
i = 0
for value in list("Collection"):
    i = i + 1
    aList.append((value, i))
    
print(aList)

[('C', 1), ('o', 2), ('l', 3), ('l', 4), ('e', 5), ('c', 6), ('t', 7), ('i', 8), ('o', 9), ('n', 10)]


In [28]:
aList.clear()

#python has a built in function, enumerate which returns a sequence of (i, value) tuples
for i, value in enumerate(list("Collection")):
    aList.append((value, i))
    
print(aList)

[('C', 0), ('o', 1), ('l', 2), ('l', 3), ('e', 4), ('c', 5), ('t', 6), ('i', 7), ('o', 8), ('n', 9)]


In [29]:
#when indexing data, a helpful pattern that uses enumerate is computing a dict mapping the values of a sequence

some_list = ['foo', 'bar', 'baz']
mapping = {i: val for i, val in enumerate(some_list)}
print(mapping)

{0: 'foo', 1: 'bar', 2: 'baz'}


In [30]:
#sorted returns a new sorted list (doesn't do it in place)
# it accepts the same arguments as sort
print(sorted(some_list))

['bar', 'baz', 'foo']


In [31]:
#zip pairs up elements of a number of lists, tuples, or other sequences to create a list of tuples
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']

zipped = zip(seq1, seq2)
#seems to create a "zip" object
print(zipped)

print(list(zipped))

<zip object at 0x00000264EFB8A980>
[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]


In [32]:
#zip can take in an arbitrary number of sequences
# the number of elements it produces is determined by the shortest sequence
seq3 = [False, True]
print(list(zip(seq1, seq2, seq3)))

[('foo', 'one', False), ('bar', 'two', True)]


In [33]:
#a common use of zip is to simultaneously iterate over multple sequences, possibly ocmbined with enumerate
for i, (a, b) in enumerate(zip(seq1, seq2)):
    print('{0}: {1}, {2}'.format(i, a, b))

0: foo, one
1: bar, two
2: baz, three


## Functions as Objects

In [34]:
#functions being objects allows many constructs to be expressed relatively easy in python

#data cleaning
states = ['   Alabama ', 'Georgia!', 'Georgia', 'georgia', 'Fl0rIda', 'south   carolina###', 'West Virginia?']

#regular expressions
import re
def clean_strings(strings):
    result = []
    for value in strings:
        #remove space at beginning and end of string
        value = value.strip()
        #replace !, #, and ? characters will nothing
        value = re.sub('[!#?]', '', value)
        #make first character of every word capital
        value = value.title()
        result.append(value)
        
    return result


print(clean_strings(states))

['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Fl0Rida', 'South   Carolina', 'West Virginia']


In [35]:
#an alternative would be to store the functions in a datastructure
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

cleanOps = [str.strip, remove_punctuation, str.title]

def cleanStrGen(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

print(cleanStrGen(states, cleanOps))
            

['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Fl0Rida', 'South   Carolina', 'West Virginia']


## Lambda functions
*lambda* or *anonymous* functions are a way of writing small, from scratch, functions that don't need to be defined as their own function. can be used to help augment the abilities of other functions. a lambda function is started with the *lambda* keyword

In [36]:
#a short functions and it's equivalent lambda
def short_function(x):
    return x * 2

equiv_anon = lambda x: x * 2

## Currying: partial argument application
*Currying* is jargon that means deriving new functions from existing ones by *partial argument application*

In [37]:
#regular function
def add_numbers(x, y):
    return x + 5

#curried, or derived function
add_five = lambda y: add_numbers(5, y)

## Generators
a consistent way to iterate over sequences, like lists, files, tuples, etc, is an important python feature. this is accomplished by implementing the iterator protcol, a generaic way to make objects iterable.

In [38]:
some_dict = {'a': 1, 'b': 2, 'c': 3}

for key in some_dict:
    print(key)

a
b
c


In [39]:
#when you write for key in some_dict, the interpreter first attempts to create an iterator out of the dict
dictIterator = iter(some_dict)
print(dictIterator)

<dict_keyiterator object at 0x00000264EFBA6450>


In [40]:
#an iterator is any object that will yield objets to the python interpreter when used in contexts like a for loop. most methods expecting
# a list, or list-like obejct will also accept iterators
print(list(dictIterator))

['a', 'b', 'c']


a *generator* is a concise way to construct iterable objects. normal funtions execute and return a single result at a time; generators return a sequence of multiple results **lazily**. lazy evaluation means that the running of the function is paused after each return, instead of looping through an entire sequence. generators are declared by using the *yield* keyword instead of return.

In [41]:
def squares(n = 10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2
        
#no code is executed when you call a generator
gen = squares()
print(gen)

<generator object squares at 0x00000264EFBAD580>


In [42]:
#code is only executed when you request elements from the generator. lazy evaluation
for x in gen:
    print(x)

Generating squares from 1 to 100
1
4
9
16
25
36
49
64
81
100


## the Itertools Module
the standard library itertools has a collection of generators for many common data algorithms:
* combinations(iterable, k) - generates a sequence of all possibnle k-tuples in the iterable, ignoring order and without replacement (A, B) == (B, A)
* permutations(iterable, k) - generates a sequence of all possible k-tuples in the iterable, respecting order (A, B) != (B, A)
* groupby(iterable, keyfunc) - generates (key, sub-iterator) for each unique key
* product(*iterables, repeat=1) - generates the cartesian product of the input iterables as tuples

In [43]:
#ex - group by
import itertools

firstLetter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

for letter, names in itertools.groupby(names, firstLetter):
    print(letter, list(names)) #names is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


## Errors and Exception Handling
Handling Python Errors or *exceptions* gracefully is an important part of building robust programs. in data analysis, many function only work on certain kinds of input

In [44]:
print(float('1.2345'))
#print(float('something')) error

1.2345


In [45]:
#this program will handle exceptions to float gracefullly
def attempt_float(x):
    try:
        return float(x)
    #this will only block value errors
    except ValueError:
        return x
    
print(attempt_float('3.14159'))
print(attempt_float('parade float'))

3.14159
parade float


## Files and the Operating System
to opena file, use the built-in *open* function. this function starts in read mode ('r') by default. we can thus treat the file like an iterator and iteratore over the lines

In [47]:
#open a file for reading or writing
file = open('test.txt')

In [49]:
for line in file:
    print(line)

this is a file.

I like this file.

it is a good file.

Hello File.


In [50]:
#the lines coming out of the file have the end-of-line markers in tact (\n), to remove them use r.strip
lines = [x.rstrip() for x in open('test.txt')]
for line in lines:
    print(line)

this is a file.
I like this file.
it is a good file.
Hello File.


In [51]:
#when you use open to create file objects, it is important to explicitly close the file. closing the file releases resources back to the operating system
file.close()

In [55]:
#creating a new file
f = open('newFile.txt', 'w')
f.write("this is a sentence.")
f.close()

In [58]:
#x attempts to write a new file. fails if file exists
try:
    f = open('newFile.txt', 'x')
    f.close()
except:
    print("file already exists.")

file already exists.


### Read, Seek, Tell
* Read - returns a certain number fo characters from a file (what a "character" is is determined by the files encoding). or returns a number of raw bytes if in binary mode. Advances the file handles read position by the number of character or bytes read.
* Tell - gives the current position of the read pointer
* Seek - changes the file position to the indicated byte in teh file

In [63]:
f = open('test.txt')
print(f.read(10))


this is a 


In [64]:
f2 = open('test.txt', 'rb')
print(f2.read(10))


b'this is a '


In [65]:
print(f.tell())
print(f2.tell())

10
10


In [66]:
#you can check a systems default encoding through the sys module
import sys
sys.getdefaultencoding()

'utf-8'

In [67]:
f.seek(0)
f.read(20)

'this is a file.\nI li'

In [68]:
f.close()
f2.close()