## GA Data Science 19 (DAT19) - Class 3 - Python  

### iPython Notebook Magic

#### Short cut keys:

comment out a line of code = command + /

run block of code = shift + enter

autocomplete text = tab

help for function = shift + tab

Web based viewer for ipython notebooks http://nbviewer.ipython.org/
###### Note: GitHub now renders iPython notebooks natively, making looking at labs easy from any computer

In [None]:
print 'Hello World'

### Basic python data types:



The most basic data structure, analogous to NULL in other languages

In [None]:
print type(None)

There are four numeric types: ***int, float, bool***, and ***complex***.

In [None]:
type(5)

In [None]:
type('five')

In [None]:
type(2.5)

In [None]:
type(True)

In [None]:
type(2+3j)

In [None]:
type(False)

Basic math with basic types - pretty obvious, except watch for integer division!

In [None]:
5 + 5

In [None]:
3 * 7

In [None]:
x = 10
y = 5
z = 4

In [None]:
x - y

In [None]:
x / y

In [None]:
x / z  # gives just the integer!

In [None]:
x / 4.0  # converts to float division if one operand is a float

In [None]:
x / float(z)  # equivalent - use float to 'cast' z to a float

In [None]:
10 % 4  # modulo  operator(i.e. remainder)

A better way to handle integer division is to import the future module:

In [8]:
from __future__ import division

In [None]:
print 10/4 #works as you would expect
print 10//4 #integer division can still be specified

The internal difference is that without that import, / is mapped to the `__div__()` method, while with it, `__truediv__()` is used. (In any case, // calls `__floordiv__()`.)

## Non-Numeric Types / Python Data Structures
If you go through the python documentation there are many different types but the important ones we'll cover here. We'll also introduce more complex types throughout the course. Types you'll frequently interact with are:
- strings (which play interestingly with lists, we'll see in a sec)
- lists
- tuples
- sets
- dictionaries

#### Python lists  

The next basic data type is the Python ***list***.
* A list is an ***ordered collection of elements***, and these elements can be of arbitrary type.
* Lists are ***mutable***, meaning they can be changed in-place.
* In python lists are indicated by square brackets **`[`** and **`]`**.
* Items in a list are indexed starting with 0.

In [None]:
new_list = []
print type(new_list)

In [None]:
new_list_2 = list()
print type(new_list_2)

In [None]:
k = [1, 'b', True]

In [None]:
k[2]

In [None]:
k[1] = 'a'

In [None]:
k

You can even have lists of lists:

In [None]:
a = [[1,2,3], 4, 5]

In [None]:
a[0]

In [None]:
a[1]

Working with lists:

In [66]:
nums = [5, 5.0, 'five']     # multiple data types

In [None]:
print nums

In [None]:
type(nums)

In [None]:
len(nums)

How do we get the first element of the list and then replace it?

In [71]:
nums.append(7)   # list 'method' that modifies the list

In [None]:
help(nums.append)

In [None]:
help(nums)

In [None]:
nums.remove('five')
print nums

In [79]:
nums[0] = 6    # replace a list element

We can sort the list:

In [None]:
sorted(nums)  # 'function' that does not modify the list

In [None]:
print nums #not affected

In [None]:
nums = sorted(nums) #overwrites our original list
print nums

In [None]:
sorted(nums, reverse=True)  # optional argument reverses sort order

#### List slicing

In [92]:
weekdays = ['mon','tues','wed','thurs','fri']

In [None]:
weekdays[0]         # element 0

In [None]:
weekdays[0:3]       # elements 0, 1, 2

In [None]:
weekdays[:3]        # elements 0, 1, 2

In [None]:
weekdays[3:]        # elements 3, 4

In [None]:
weekdays[-1]        # last element (element 4)

In [None]:
weekdays[::2]       # every 2nd element (0, 2, 4)

In [None]:
weekdays[::-1]      # backwards (4, 3, 2, 1, 0)

In [None]:
days = weekdays + ['sat','sun']     # concatenate lists
print days

#### Python Strings
Strings are written with either single or double quotes and have a variety of methods to operate on them. We will not work with much text data early on but as we go deeper into the class, feature creation based off of text data may be useful for your projects. A few useful methods:

In [None]:
a = 'Hello world!'
print a
print a.lower()
print a.upper()

In [None]:
name = 'my name'
print a + ' My name is ' + name

In [None]:
print a*3

In [None]:
groceries = "milk,eggs,bacon,spinach,olive oil,cookies"
groceries.split(',')

In [None]:
groceries.split(',')[-1]

In [37]:
pretty_groceries = '\n- '.join(groceries.split(','))
#print pretty_groceries
#print '- '+ pretty_groceries
#pretty_groceries

#### Python tuples
Tuples are similar to lists, except that they are ***immutable***, meaning that they cannot be changed in place.  Tuples are indicated by parenthesis **`(`** and **`)`**.

In [38]:
x = (1, 'a', 2.5)

In [None]:
type(x)

In [None]:
x

In [None]:
(1,)

In [None]:
x[0]

In [None]:
x[0] = 'b'

#### Python sets

A python set is an unordered mutable collection of distinct elements - i.e. a list with duplicates removed.

In [None]:
y = set([1, 1, 2, 3, 5, 8])

In [None]:
y

A set can be used as a quick method to removed duplicate elements from a list:

In [None]:
a = [1, 3, 4, 5, 5, 5, 7, 8, 8, 9]

In [None]:
list(set(a))

#### Python dictionaries (i.e. associative arrays)

Dictionaries are ***unordered*** collections of ***key-value pairs***.  Dictionaries can use various types for keys (e.g. strings, ints) so long as the key is immutable.  Values are looked up by key.  In python dictionaries are indicated by ***curly braces { }***

In [None]:
info = {'name': 'Bob', 'age': 54, 'kids': ['Henry', 'Phil']}

In [None]:
info['name']

In [None]:
info['age'] = 55

In [None]:
info['age']

In [None]:
num_kids = len(info['kids'])

In [None]:
num_kids

Some additional references:

http://python.org/documentation/  
http://www.greenteapress.com/thinkpython/thinkpython.pdf  
http://nbviewer.ipython.org/url/www.astro.washington.edu/users/vanderplas/Astr599/notebooks/01_basic_training.ipynb  
http://nbviewer.ipython.org/url/www.astro.washington.edu/users/vanderplas/Astr599/notebooks/02_advanced_data_structures.ipynb

## Flow Control

Lists, tuples, sets,and dictionaries can be iterated through (and this will be very common). There are good and bad ways to iterate through lists.

In [None]:
# for loop to print 1 through 5
nums = range(1, 6)      # create a list of 1 through 5
for num in nums:        # num 'becomes' each list element for one loop
    print num

In [None]:
# for loop to print 1, 3, 5
other = [1, 3, 5]       # create a different list
for x in other:         # name 'x' does not matter
    print x             # this loop only executes 3 times (not 5)

In [None]:
# for loop to create a list of cubes of 1 through 5
cubes = []                  # create empty list to store results
for num in nums:            # loop through nums (will execute 5 times)
    cubes.append(num**3)    # append the cube of the current value of num
    
print cubes

In [None]:
# equivalent list comprehension
cubes = [num**3 for num in nums]
print cubes

### Python File Object

In [None]:
fh = open('../data/chipotle.tsv', 'r')

for line in fh:
    if 'Veggie Salad Bowl' in line:
        print line

In [None]:
with open('../data/chipotle.tsv', 'r') as fh:
    for line in fh:
        if 'Veggie Salad Bowl' in line:
            print line

### Python packages (brief intro)

Python has an extensive collection of third-party libraries, or ***packages***, with additional functions, data-structures, etc.  Many (most?) packages of interest are hosted on the Python Package Index ***pypi***, and can be installed into your environment using ***pip***.  The Anaconda distribution in the pre-work includes a number of these that are useful in data science, so you should have most of them installed already.  

ref:  https://pypi.python.org/pypi

To start, let's import a python package called numpy.

In [None]:
import numpy as np

A note on namespaces when importing - there are a few different ways to import:

    1. import numpy

    - all submodules of numpy are accessible in the numpy namespace
    - e.g. numpy.array([1,2,3])


    2. import numpy as np

    - same as 1 except an alias 'np' is created for the namespace instead
    - e.g. np.array([1,2,3])


    3. from numpy import *

    - adds all submodules to global namespace
    - e.g. array([1,2,3])
    - Note: This can be dangerous because if different modules have submodules with the same name than whatever is imported last will overrite what came before it - i.e. naming collision -> overwriting!.
    
    4. from numpy import array
    
    - will import only the indicated submodules into the global namespace
    - e.g. array([1,2,3])
    - Note: can be ok since you are being explicit
    
    We will generally use 2 and 4 (sparingly).

What does this package contain?  Introspect with `dir` to find out which names a module defines:

In [None]:
dir(np)

For detailed info about a particular name, use the ***help*** command:

In [None]:
help(np.array)

In [None]:
data = np.array([[1, 2, 3],[2, 4, 9]])
data

In [None]:
data[0]  # first row

In [None]:
data[ : , 1]  # all rows, second column

Because of the way numpy arrays are stored in memory, all elements have to be of the same type.  If there is just one float, all the values will be converted to a float.

In [None]:
data = np.array([[1, 2, 3],[2, 4.5, 9]])
data

Notice that the format of this data is different, something called a numpy `record array`.  

ref: http://docs.scipy.org/doc/numpy/user/basics.rec.html#structured-arrays

In [None]:
# data[0]  # first row of data array

In [None]:
# data[0][3]  # first row, fourth item in tuple

Plus there's more!
    
- list and array slicing
- python list comprehensions
- numpy array manipulation