# A crash course on Python for Data Analysis

1.	Python basic programming:
    + Python modules or how to extend the basic instruction set
    + Data Types and Operations
    + A program in Python
    + A function in Python
    + Slicing
    + References
    + Comprehensions
    + Generators
    + Objects
    + Reading data
    + Python Goodies
2.	Advanced programming for data analysis: NumPy
3.  Advanced programming for data analysis: pandas
4.  Matplotlib
5.  Sklearn

# Python online resources

+ Python Practice Book: [Link](http://anandology.com/python-practice-book/index.html)
+ Python Online Course at Code Academy: [Link](http://www.codecademy.com/en/tracks/python)
+ Think Python: How to Think Like a Computer Scientist Book: [Link](http://greenteapress.com/thinkpython/index.html)

## Python

The software program that you use to invoke operators is called an **interpreter**. 

You enter your commands as a ‘dialog’ between you and the interpreter. 

Commands can be entered as part of a script (a text file with a list of commands to perform) or directly at the *cell*. 

In order to ask to the interpreter what to do, you must **invoke** an operator:

In [None]:
3 + 4 + 9

In [None]:
range(10)

It’s helpful to think of the computation carried out by an operator as involving four parts:

+ The name of the operator
+ The input arguments
+ The output value
+ Side effects

A typical operation takes one or more input arguments and uses the information in these to produce an output value. Along the way, the computer might take some action: display a graph, store a file, make a sound, etc. These actions are called side effects.

Python is a general-purpose programming language, so when we want to use more specific commands (such as statistical operators or string processing operators) we usually need to import them before we can use them. <br>

For Scientific Python, one of the most important libraries that we need is **numpy** (Numerical Python), which can be loaded like this: ``import numpy as np``

In [None]:
import numpy as np
np.sqrt(25)

In [None]:
np.arange(10)

In [None]:
type(np.arange(10))

Access to the functions, variables and classes of a module depends on the way the module was imported:

In [None]:
import math
math.cos(math.pi)

In [None]:
import math as m           # import using an alias
m.cos(m.pi)

In [None]:
from math import cos,pi    # import only some functions
cos(pi)

In [None]:
from math import *         # global import
cos(pi)

It is critical to define a clear policy for ``import``

Often the value returned by an operation will be used later on. Values can be stored for later use with the **assignment operator**:

In [None]:
a = 101

The command has stored the value 101 under the **name** <code>a</code>. Such stored values are called **objects**. 

Making an assignment to an object **defines** the object. 

Once an object has been defined, it can be referred to and used in later computations. 

To refer to the value stored in the object, just use the object’s name itself. For instance:

In [None]:
b = np.sqrt(a)
b

There are some general rules for object names:

+ Use only letters and numbers and ‘underscores’ (_)
+ Do NOT use spaces anywhere in the name
+ A number cannot be the first character in the name
+ Capital letters are treated as distinct from lower-case letters (i.e., Python is case-sensitive)

In [None]:
3a = 10

When you assign a new value to an existing object (*dynamic typing*), the former values of that object is erased from the computer memory. The former value of b was 10.0498756211, but after a new assignment:

In [None]:
b = 'a'
print b

The value of an object is changed only via the assignment operator. Using an object in a computation does not change the value. 

The brilliant thing about organizing operators in terms of input arguments and output values is that the output of one operator can be used as an input to another. This lets complicated computations be built out of simpler ones.

One way to connect the computations is by using objects to store the intermediate outputs:

In [None]:
a = np.arange(5)
np.sqrt(a)

You can also pass the output of an operator directly as an argument to another operator:

In [None]:
np.sqrt(np.arange(5))

### Data Types

Most of the examples used so far have dealt with numbers. But computers work with other kinds of information as well: text, photographs, sounds, sets of data, and so on. The word *type* is used to refer to the kind of information. 

It’s important to know about the types of data because operators expect their input arguments to be of specific types. When you use the wrong type of input, the computer might not be able to process your command.

For our purposes, it’s important to distinguish among several basic types:

+ Numeric (positive and negative) data: 
    + decimal and fractional numbers (**floats**), <code>a = 3.5</code>
    + whole numbers (**integers**), <code> b = -12560</code>, and 
    + arbitrary length whole numbers (**longs**):  <code>c=1809109863596239561236235625629561L</code>
+ **Strings** of textual data - you indicate string data to the computer by enclosing the text in quotation marks (e.g., <code>name = "python"</code>).
+ **Boolean** data: <code>a = True</code> or <code>a = False</code>.
+ **Complex** numbers: <code>a = 2+3j</code>
+ Sequence types: **tuples, lists, sets, dictionaries** and **files**.

In [None]:
a = 10

In [None]:
a = 'a'
print a

### Operators

+ Addition (also string, tuple and list concatenation) <code>a + b</code>
+ Subtraction (also set difference): <code>a - b</code>
+ Multiplication (also string, tuple and list replication): <code>a * b</code>
+ Division: <code>a / b</code>
+ Truncated integer division (rounded towards minus infinity): <code>a // b</code>
+ Modulus or remainder: <code>a % b</code>
+ Exponentiation: <code>a ** b</code>
+ Assignment: <code>=</code>, <code>-=</code>, <code>+=</code>,<code>/=</code>,<code>*=</code>, <code>%=</code>, <code>//=</code>, <code>**=</code>
+ Boolean comparisons: <code>==</code>, <code>!=</code>, <code><</code>,<code>></code>,<code><=</code>, <code>>=</code>
+ Boolean operators: <code>and</code>, <code>or</code>, <code>not</code>
+ Membership test operators: <code>in</code>, <code>not in</code>
+ Object identity operators: <code>is</code>, <code>is not</code>
+ Bitwise operators (or, xor, and, complement): <code>|</code>, <code>^</code>, <code>&</code>, <code>~</code>
+ Left and right bit shift: <code><<</code>, <code>>></code>

### Python as a calculator

The Python language has a concise notation for arithmetic that looks very much like the traditional one.

In [None]:
a = 3+2
b= 3.5 * -8
c = 10/6
print a, b, c, 10./6.

Some math functions are not available in the basic Python module, and they need to be imported from a specific module:

In [None]:
import math   # this instruction is not executed if the module has already been imported
print math.pi + math.sin(100) + math.ceil(2.3)

### A program in Python

General Rules:

+ All text from a <code>#</code> simbol to the end of a line are considered as comments.
+ Code must be **indented** and sometimes delineated by colons. The Python standard for indentation is four spaces. Never use tabs: it can produce hard to find errors. Set you editor to convert tabs to spaces.
+ Typically, a statement must be on a line. You can use a backslash <code>\</code> at the end of a line to continue a statement on to the next line.


In [None]:
# This program computes the factorial of 100.

fact = 1L
n= 100
for factor in range(n,0,-1):
    fact = fact * factor 
print fact    

In [None]:
range(10,0,-1)

When we write a colon at the end of an iteration, all lines indented at the next level are considered *part* of the iteration. 

When we write a line at the same indentation as the iteration, we are closing the iteration.

### A function in Python



#### Factorial

The factorial of a non-negative integer $n$, denoted by $n!$, is the product of all positive integers less than or equal to $n$.  

In [None]:
def factorial(n):
    """
    Return the factorial of n. 
    """
    fact = 1L
    for factor in range(n,0,-1):
        fact = fact * factor
    return fact

In [None]:
factorial(n)

In [None]:
help(factorial)

#### Fibonacci

The Fibonacci Sequence is the series of numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

The general rule to compute the sequence is very simple: The next number is found by adding up the two numbers before it.

In [None]:
def fib1(n):
    """
    Return n-th member of the Fibonacci sequence. 
    It uses a recursive approach.
    """
    if n==1:
        return 1
    if n==0:
        return 0
    return fib1(n-1) + fib1(n-2)

fib1(20)

# this function cannot compute fib(100)

In [None]:
def fib2(n):
    """
    Return n-th member of the Fibonacci sequence. 
    It uses a non-recursive approach.
    """
    a, b = 0, 1
    for i in range(1,n+1):
        a, b = b, a + b
    return a

n = 1000
if n<15:
    print fib1(n)
else: 
    print fib2(n)

#### Greatest Common Divisor

The greatest common divisor of two positive integers $a$ and $b$ is the largest divisor common to $a$ and $b$.  The Euclidean algorithm, or Euclid's algorithm, is an interative method for computing the greatest common divisor of two integers. 

+ If $a<b$, exchange $a$ and $b$.
+ Divide $a$ by $b$ and get the remainder, $r$. If $r=0$, report $b$ as the GCD of $a$ and $b$.
+ Replace $a$ by $b$ and replace $b$ by $r$. If $r \neq 0$ iterate.



In [None]:
def gcd(a,b): 
    """
    Euclides algorithm v1.0: pseudocode translation
    """
    r = 1
    while r != 0:
        if a<b:
            c=a
            a=b
            b=c
        r = a%b 
        if r == 0:
            return b
        else:
            a = b
            b = r

gcd(100,16)

In [None]:
def gcd(a,b):  
    """
    Euclides algorithm v2.0: idiomatic Python    
    """
    while a:
        a, b = b%a, a
    return b

gcd(100,16)

In [None]:
a = 67
a == False

#### Exercise

Compute an approximation $\pi$ using Monte Carlo. 

How?  If we can estimate the area of the unit circle, then dividing by $r^2 = (1/2)^2 = 1/4$
gives an estimate of $\pi$. We may estimate the area by sampling bivariate uniforms and looking at the fraction that fall into the unit circle.


In [None]:
from math import sqrt
import random
## Your solution here.



### String processing with Python

Strings are list of characters:

In [None]:
a = "python"
type(a)

In [None]:
print "Hello"

In [None]:
print "This is 'an example' of the use of quotes and double quotes"
print 'This is "another example" of the use of quotes and double quotes'

We can use the operator ``+`` to concatenate strings:

In [None]:
a = 'He'
b = 'llo'
c = a+b+'!'
print c

Substrings within a string can be accessed using **slicing**. Slicing uses ``[]`` to contain the indices of the characters in a string, where the first index is $0$, and the last is $n - 1$ (assuming the string has $n$ characters). 

In [None]:
a = 'Python'
print a[:], a[1], a[2:], a[:3], a[2:4], a[::2], a[1::2]

The most advanced string functiona are stored in an external module called ``string``

In [None]:
import string as st
help(st)

In [None]:
type(st.atof('10.3'))

In [None]:
a = 'a'

In [None]:
# press tab
a.

In [None]:
a='Hello'
b = a.lower()
print b

In [None]:
print st.ascii_letters
'a' in st.ascii_letters

The ``.format()`` method of the str type is an extremely convenient way to format text exactly the way you want it.

In [None]:
print "We have {} hectares planted to {}.".format(49, "okra")

In [None]:
print "We have {0} hectares planted to {1}.".format(49, "okra")

In [None]:
print "We have {1} {1} {1} hectares planted to {0}.".format(49, "okra")

In [None]:
print "{monster} has now eaten {city}".format(city='Tokyo', monster='Mothra')

In [None]:
print "Res: {} Km".format(1342.893663456)

By default values are formatted to take up only as many characters as needed to represent the content. It is however also possible to define that a value should be padded to a specific length.

In [None]:
print "Res: {:10.4f} Km".format(1342.893663456)
print "Res: {:10.4f} Km".format(42.8)
print "Res: {:10.4f} Km".format(134.89)

``10.4`` means a width of 10 characters and a precision of 4 decimal places.

In [None]:
l = (1342.893663456, 25, 5)
print "Res: {0:9.4f} Km".format(*l)
print "Res: {1:9.4f} Km".format(*l)
print "Res: {2:9.4f} Km".format(*l)

### Lists

Lists are a built-in data type which require other data types to be useful. A list is a collection of other objects – floats, integers, complex numbers, strings or even other lists.

Lists also support slicing to retrieve one or more elements. Basic lists are constructed using square braces, ``[]``, and values are separated using
commas.

In [None]:
l=[]
type(l)

In [None]:
x=[1,2,3,4,[1,2,3,4],'jordi']
print x[4:], x[0], x[5], x[:4]

In [None]:
x[-2:]  # The stride can also be negative which can be used to select the
        # elements of a list in reverse order.

Lists can be multidimensional and slicing can be done directly in higher dimensions:

In [None]:
x = [[1,2,3,4], [5,6,7,8]]
print x[0], x[0][0]

### Help: Python Tutorial

In [None]:
from IPython.display import HTML
HTML('<iframe src=http://docs.python.org/3/tutorial/index.html?useformat=mobile width=780 height=350></iframe>')

### Conditionals

The conditional structure in Python is <code>if</code>. It is usally combined with 
relational operators: <code> <, <=, ==, >=, >, != </code>.

In [None]:
def main(celsius):
    fahrenheit = 9.0 /5.0 * celsius + 32
    print "The temperature in Fahrenheit is", fahrenheit
    if fahrenheit > 90:
        print "It's really hot out there."
    elif fahrenheit < 30:
        print "It's really cold out there."
    else: pass
        
main(35)

``If`` statesments can be combined with loops (``for``, ``while``):

In [None]:
numbers = [-5, 3,2,-1,9,6]
total = 0
for n in numbers:
    if n >= 0:
        total += n
print total

In [None]:
def average(a):
    sum = 0.0
    for i in a:
        sum = sum + i
    return sum/len(a)

average([1,2,3,4])

In [None]:
def sumdif(x,y):
    sum, dif = x+y, x-y
    return sum, dif

a, b = sumdif(2,2)
print a, b

In [None]:
def main(n):
    cont = 0
    while (int(n) > 0):
        cont += 1
        n = n/2
#        print n
    return cont-1

main(10)
# main(10.3)

### Boolean operators.

In [None]:
a = 4
b = 40
(a>2) and (b>30)

In [None]:
(a>2) or (b>100)

In [None]:
not(a>2)

### Data Collections

We need to represent data collections: words in a text, students in a course, experimental data, etc., or to store intermediate results. The most simple data collection is the <code>list</code> (an ordered sequence of objects):

In [None]:
range(10)

In [None]:
import string
b = string.split("This is an example")
print b

Lists are *mutable, dynamic and non-homogeneous* objects:

In [None]:
a = [1,2,3,4]
a[1] = 1
print a

In [None]:
c = a + b
print c

In [None]:
zeroes = [0] * 10
del zeroes[5:]
print zeroes

In [None]:
zeroes.append(1)
print zeroes

In [None]:
zeroes.remove(1)
print zeroes

In [None]:
if 1 in zeroes:
    print False
else: 
    print True

#### References

We can inspect the reference of an object:

In [None]:
a ='hello'
print id(a)

Two different objects:

In [None]:
a = [1,2,3]
b = [1,2,3]
print id(a), id(b)

Object alias:

In [None]:
a = [1,2,3]
b = a                     # alias
print id(a), id(b)

Cloning:

In [None]:
a = [1,2,3]
b = a[:]                  # cloning with :

print a, b, b[1:], id(a), id(b), id(b[1:])

When a list is an argument of a function, we are sending the *reference*, not a *copy*

In [None]:
def head(list):
    return list[0]

numbers=[1,2,3,4]
print head(numbers), numbers

In [None]:
def change_first_element(list):
    list[0]=0
    
numbers=[1,2,3,4]
change_first_element(numbers)
print numbers

If we return a list we are returning a reference:

In [None]:
def tail(list):
    return list[1:]     # we are creating a new list

numbers=[1,2,3,4]
rest = tail(numbers)
print rest, numbers
print id(rest), id(numbers)

In [None]:
# Press tab
numbers.

Sometimes it is important to perform a *sanity check* about what is doing a pre-defined function:

In [None]:
numbers=[1,2,3,4]
def test(l):
    return l.reverse()

test(numbers)
print numbers, numbers.reverse(), numbers

### Dictionaries

A dictionary is a collection that allows the access of an *element* by using a *key*:

In [None]:
dict = {"d": "D", "b":"B", "c":"C"}
dict["d"]

Dictionaries are **mutable** and **unordered**:

In [None]:
dict["a"]="A"
dict

In [None]:
dict.has_key("a")

In [None]:
del dict["a"]
print dict

In [None]:
dict = {"d": "D", "b":"B", "c":"C"}
dict.items()

### Tuples

Tuples are **non-mutable** lists:

In [None]:
tup = ('a', 'b', 'c')
print type(tup), tup[1:3]

In [None]:
tup[0]='d'

**Example**: How to compute the statistics of words in a document.

In [None]:
# We can read a file in a list of line strings or as a string
# -*- coding: latin-1 -*-

text = open("files/text.txt",'r').readlines()
for l in text:
    print l[:-1]  # The last character is a 'new line'

In [None]:
import string

def compare((w1,c1),(w2,c2)):
# A sorting funtion returns negative if x<y, zero if x==y, positive if x>y.
    if c1 < c2:
        return -1
    elif c1 == c2:
        return 0
    else:
        return 1
    
def main():
    
# We read the file in a string
    text = open("files/text.txt",'r').read()
    text = string.lower(text)
    for ch in '!"#$%&/()=?¿^*`+¨[]{}-:;,.':
        text = string.replace(text, ch, ' ')
# We build a list with all words (blank separation)
    words = string.split(text)

# We use a dictionary for counting words
    counts = {}
    for w in words:
        try:
            counts[w] = counts[w] + 1
        except KeyError:
            counts[w] = 1
            
# We create a sorted list
    items = counts.items()
    items.sort(compare)
    
# We print the list
    for i in range(len(items)):
        print "%-2s%3d" % items[i],'|' ,

main()    

### Lists (and dictionary) comprehensions

Lists comprehensions are a way to fit a ``for`` loop, an ``if`` statement, and an assignment all in one line.

A list comprehension consists of the following parts:

+ An input sequence.
+ A variable representing members of the input sequence.
+ An optional expression.
+ An output expression producing elements of the output list from members of the input sequence that satisfy the predicate.

In [None]:
num = [1, 4, -5, 10, -7, 2, 3, -1]
squared = [ x**2 for x in num if x > 0]
print type(squared), squared

<div class = "alert alert-success">  There is a downside to list comprehensions: the entire list has to be stored in memory at once. This isn’t a problem for small lists like the ones in the above examples, or even of lists several orders of magnitude larger. But we can use **<font color="red">generators</font>** to solve this problem.</div>



Generator expressions do not load the whole list into memory at once, but instead create a *generator object* so only one list element has to be loaded at any time.

Generator expressions have the same syntax as list comprehensions, but with parentheses around the outside instead of brackets:

In [None]:
num = [1, 4, -5, 10, -7, 2, 3, -1]
squared = ( x**2 for x in num if x > 0 )
print type(squared), squared

The elements of the generator must be accessed by an iterator because they are generated when needed:

In [None]:
lis = []
for item in squared:
    lis = lis + [item]
print lis

We can define our own generators with the ``yield`` statesment. For example, let's build a generator for the binary representation of a number between 0 and 1 with arbitrary precision.

In [None]:
# binary representation of a number between 0 and 1 (b bits precision).

def res(n,b):
    bin_a = '.'
    for i in range(b):
        n *= 2
        bin_a +=  str(int(n))
        n = n % 1
    return bin_a

print res(1/3.0,10)

In [None]:
# binary representation of a number between 0 and 1 (precision as needed).

def binRep(n):
    while True:
        n *= 2
        yield int(n)
        n = n % 1

a = binRep(1/3.) 
a_bin = '.'
for i in range(50):
    a_bin +=  str(a.next())
    
print a_bin

** Exercise **

+ Write a function with two parameters <code>a</code> and <code>b</code>, to compute the final amount we get if we deposite 1000€ during <code>a</code> years in a bank account with an interest rate of <code>b</code> per cent.
+ What is the result for ``a``=10, ``b``=10

In [None]:
# Your solution here

**Exercise**

+ Write a function with one parameter <code>a</code>, to compute the minimum period we need to double the amount in an account with an interest rate of <code>a</code> per cent.
+ What is the result for ``a``=3?

In [None]:
# Your solution here

** Exercise**

> (...) In mathematics, the sieve of Eratosthenes, one of a number of prime number sieves, is a simple, ancient algorithm for finding all prime numbers up to any given limit. A prime number is a natural number which has exactly two distinct natural number divisors: 1 and itself. To find all the prime numbers less than or equal to a given integer $n$ by Eratosthenes' method.
> (Source: *Wikipedia*)

+ Write a program to implement the Eratostenes algorithm. First create a list of consecutive integers from 2 through $n$: (2, 3, 4, ..., n). Then,    
    - Initially, let $p$ equal 2, the first prime number.
    - Starting from $p$, enumerate its multiples by counting to $n$ in increments of $p$, and mark them in the list (these will be 2p, 3p, 4p, etc.; the $p$ itself should not be marked).
    - Find the first number greater than $p$ in the list that is not marked. If there was no such number, stop. Otherwise, let $p$ now equal this new number (which is the next prime), and repeat from step 2.
    - When the algorithm terminates, all the numbers in the list that are not marked are prime. (...)

In [None]:
# Your solution here

**Exercise:** 

+ Compute the set of bigrams of a string. (``'hola'->'ho'+'ol'+'la'``) 

In [None]:
# Your solution here

**Exercise:** 

+ Compute the set of substrings of a string. 

In [None]:
# Your solution here

**Exercise:** 

+ Take two lists, say for example these two:

	``a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]  b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]``
    
  and write a program that returns a list that contains only the elements that are common between the lists (without duplicates).

In [None]:
# Your solution here

### Objects

You can define your own classes and objects.

In [None]:
#creating a class

class Rectangle:
    def __init__(self,x,y):
        self.x = x
        self.y = y
    description = "This shape has not been described yet"
    author = "Nobody has claimed to make this shape yet"
    def area(self):
        return self.x * self.y
    def perimeter(self):
        return 2 * self.x + 2 * self.y
    def describe(self,text):
        self.description = text
    def authorName(self,text):
        self.author = text
    def scaleSize(self,scale):
        self.x = self.x * scale
        self.y = self.y * scale

#creating objects
a = Rectangle(100, 45)
b = Rectangle(10,230)

#describing the rectangles
a.describe("A fat rectangle")
b.describe("A thin rectangle")

In [None]:
#finding the area of your rectangle:
print a.area()
 
#finding the perimeter of your rectangle:
print a.perimeter()

#getting the description
print a.description
print a.author

In [None]:
#finding the area of your rectangle:
print b.area()
print b.description

#making the rectangle 50% smaller
b.scaleSize(0.5)
b.describe("A small thin rectangle")
 
#re-printing the new area of the rectangle
print b.area()
print b.description

### Exercise

Build a simple class called ``Polynomia``l for representing and manipulating polynomial functions such as

$$ p(x) = a_0 + a_1 x + \dots + a_n x^n $$

The instance data for the class ``Polynomial`` will be the coefficients ($a_1, \dots, a_n$). Provide methods that:

+ Evaluate the polynomial, returning $p(x)$ for any $x$
+ Differentiate the polynomial, replacing the original coefficients with those of its derivative $p′$.

In [None]:
# Your solution

class Polynomial(object):
    #your code
    pass
        
a = Polynomial([2,4])
print a.eval(1)
a.differentiate()
print a.eval(1)

## Reading Data

The most convenient method that you can use to work with data is to load it directly into memory.

When you load a file (of any type), the entire dataset is available at all times and the loading process is quite direct:

In [None]:
with open("files/SeaIce.txt", 'r') as input_file:
    print 'File content', input_file.read()

The entire dataset is loaded from the library into free memory. Of course, the loading process will fail if your system lacks sufficient memory to hold the dataset. When this problem occurs, you need to consider other techniques
for working with the dataset, such as **streaming** it or **sampling** it.

Here’s an example of how you can stream data using Python:

In [None]:
with open("files/SeaIce.txt", 'r') as input_file:
    for observation in input_file:
        print 'Reading Data: ' + observation,

The ``input_file`` file object contains a pointer to the open file. As the code performs data reads in the for loop, the file pointer moves to the next record.

Data streaming obtains all the records from a data source. You may find that
you don’t need all the records. You can save time and resources by simply
sampling the data. To this end we will use the ``enumerate`` function.

In [None]:
n = 17
with open("files/SeaIce.txt", 'r') as input_file:
    for j, observation in enumerate(input_file):
        if j % n==0:
            print('Reading Line: ' + str(j) + ' Content: ' + observation), 

You can perform random sampling as well.

In [None]:
import random
sample_size = 0.01
with open("files/SeaIce.txt", 'r') as input_file:
    for j, observation in enumerate(input_file):
        if random.random()<=sample_size:
            print('Reading Line: ' + str(j) + ' Content: ' + observation), 

### Exercise

Consider the polynomial

$$ p(x) = a_0 + a_1 x + \dots + a_n x^n $$

Write a function ``p`` such that ``p(x, coeff)`` computes the value in the polynomial given a point ``x`` and a list of coefficients ``coeff``. Try to use ``enumerate()`` in your loop.

In [None]:
# Your solution

### Reading from a CSV file

A flat file presents the easiest kind of file to work with. 

A problem with using native Python techniques is that the input isn’t intelligent. For example, when a file contains a header, Python simply reads it as yet more data to process, rather than as a header (not a problem for Pandas!).

The least formatted and therefore easiest‐to‐read flat‐file format is the text file. However, a text file also treats all data as strings, so you often have to convert numeric data into other forms.

A comma‐separated value (CSV) file provides more formatting and more information, but it requires a little more effort to read.

At the high end of flat‐file formatting are custom data formats, such as an Excel file, which contains extensive formatting and could include multiple datasets in a single file.

A CSV file provides more formatting than a simple text file. In fact, CSV files can become quite complicated. There is a standard that defines the format of CSV files, and you can see it at https://tools.ietf.org/html/rfc4180.

The ``csv`` module is useful for working with data exported from spreadsheets and databases into text files formatted with fields and records, commonly referred to as comma-separated value (CSV).

In [None]:
import csv

f = open("files/Advertising.csv", 'r')
try:
    reader = csv.reader(f)
    for row in reader:
        print row
finally:
    f.close()

In [None]:
with open("files/Advertising.csv", 'r') as input_file:
    reader = csv.reader(input_file)
    for row in reader:
        print row

When you have data to be imported into some other application, writing ``csv`` files is just as easy as reading them. 

Use ``writer()`` to create an object for writing, then iterate over the rows, using ``writerow()`` to print them.  

In [None]:
import csv

ifile  = open("files/Advertising.csv", 'r')
reader = csv.reader(ifile)

ofile  = open('test.csv', "w")
writer = csv.writer(ofile, delimiter=',', lineterminator='\n')

for row in reader:
    writer.writerow(row)

ifile.close()
ofile.close()

A more ellegant way of reading a CSV file:
+  Wrap the CSV reader in a function that returns a generator
+  Use context managers ``with [callable] as [name]`` to ensure that the handle to the file is closed automatically.
+  Use the ``csv.DictReader`` class when headers are present (otherwise just use ``csv.reader``)

In [None]:
import csv

ADV = 'files/Advertising.csv'

def read_data(path):
    with open(path, 'r') as data:
        reader = csv.DictReader(data)
        for row in reader:
            yield row

for idx, row in enumerate(read_data(ADV)):
    if idx < 10: print row
    else: break

The file is not opened, read, or parsed until you need it. This is powerful because it means that even for much larger data sets you will have efficient, portable code. 

In [None]:
data = read_data(ADV)
print data

In [None]:
for c in read_data(ADV):
    print c
    break

# Python goodies

## Accessing documentation

Every Python object contains the reference to a string, known as a ``doc string``, which in most cases will contain a concise
summary of the object and how to use it.

In [None]:
help(sum)

In [None]:
L = []
help(L)

## Progress Bar

In [None]:
!pip install tqdm 

In [None]:
import tqdm
from time import sleep
text = ""
for char in tqdm.tqdm(range(1000)):
    sleep(0.01)

## Decorators

Everything in Python is an object that can be treated like a value (e.g. functions, classes, modules). You can pass them as arguments to functions, and return them from functions:

In [None]:
def is_even(value):
    """Return True if *value* is even."""
    return (value % 2) == 0

def count_occurrences(target_list, predicate):
    """Return the number of times applying the function *predicate* to a
    list element returns True."""
    return sum([1 for e in target_list if predicate(e)])

#my_predicate = is_even
#my_list = [2, 4, 6, 7, 9, 11]
#result = count_occurrences(my_list, my_predicate)

result = count_occurrences([2, 4, 6, 7, 9, 11], is_even)
print(result)

The magic is in the lines ``my_predicate = is_even``. We bound the name ``my_predicate`` to the function itself (not the value returned when calling it) and can use it like any "normal" variable.

Functions can also be returned from functions as the return value:

In [None]:
def surround_with(surrounding):
    """Return a function that takes a single argument and."""
    def surround_with_value(word):
        return '{}{}{}'.format(surrounding, word, surrounding)
    return surround_with_value

def transform_words(content, targets, transform):
    """Return a string based on *content* but with each occurrence 
    of words in *targets* replaced with
    the result of applying *transform* to it."""
    result = ''
    for word in content.split():
        if word in targets:
            result += ' {}'.format(transform(word))
        else:
            result += ' {}'.format(word)
    return result

markdown_string = 'My name is Jeff Knupp and I like Python but I do not own a Python'
markdown_string_italicized = transform_words(markdown_string, ['Python', 'Jeff'], surround_with('*'))
print(markdown_string_italicized)

We've now seen that functions can both be sent as arguments to a function and returned as the result of a function. What if we made use of both of those facts together? Can we create a function that takes a function as a parameter and returns a function as the result. Would that be useful?



In [None]:
def fib(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)

%timeit fib(20)

In [None]:
def memoize(f):
    memo = {}
    def helper(x):
        if x not in memo:            
            memo[x] = f(x)
        return memo[x]
    return helper
    
@memoize
def fib(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)

%timeit fib(20)

## Lambda, filter, reduce and map

The ``lambda`` operator or lambda function is a way to create small anonymous functions, i.e. functions without a name. These functions are throw-away functions, i.e. they are just needed where they have been created. Lambda functions are mainly used in combination with the functions ``filter(), map()`` and ``reduce()``. 

In [None]:
f = lambda x, y : x + y
f(1,1)

The advantage of the lambda operator can be seen when it is used in combination with the ``map()`` function. 
``map()`` is a function with two arguments:

``r = map(f, seq)``

``map()`` applies the function ``func`` to all the elements of the sequence ``seq``. 

It returns a new list with the elements changed by ``func``.

In [None]:
def fahrenheit(T):
    return ((float(9)/5)*T + 32)
def celsius(T):
    return (float(5)/9)*(T-32)
temp = (36.5, 37, 37.5,39)

F = map(fahrenheit, temp)
C = map(celsius, F)
print F
print C

By using lambda, we wouldn't have had to define and name the functions ``fahrenheit()`` and ``celsius()``:

In [None]:
Celsius = [39.2, 36.5, 37.3, 37.8]
Fahrenheit = map(lambda x: (float(9)/5)*x + 32, Celsius)
print Fahrenheit
C = map(lambda x: (float(5)/9)*(x-32), Fahrenheit)
print C

The function ``filter(function, list)`` offers an elegant way to filter out all the elements of a list, for which the function function returns ``True``.

The function ``filter(f,l``) needs a function ``f`` as its first argument. ``f`` returns a Boolean value, i.e. either ``True`` or ``False``. This function will be applied to every element of the list ``l``. Only if ``f`` returns ``True`` will the element of the list be included in the result list.

In [None]:
fib = [0,1,1,2,3,5,8,13,21,34,55]
result = filter(lambda x: x % 2, fib)
print result
result = filter(lambda x: x % 2 == 0, fib)
print result

The function ``reduce(func, seq)`` continually applies the function ``func()`` to the sequence ``seq``. It returns a single value. 

If ``seq = [ s1, s2, s3, ... , sn ]``, calling ``reduce(func, seq)`` works like this:
+ At first the first two elements of ``seq`` will be applied to ``func``, i.e. ``func(s1,s2)`` The list on which ``reduce()`` works looks now like this: ``[ func(s1, s2), s3, ... , sn ]``
+ In the next step func will be applied on the previous result and the third element of the list, i.e. ``func(func(s1, s2),s3)``
The list looks like this now: ``[ func(func(s1, s2),s3), ... , sn ]``
+ Continue like this until just one element is left and return this element as the result of ``reduce()``.

In [None]:
reduce(lambda x,y: x+y, [47,11,42,13])

### Exercise

1. Determine the maximum of a list of numerical values by using ``reduce``.
2. Calculate the sum of the numbers from 1 to 100 by using ``reduce``.

In [None]:
# Your solution here

## Serialization

The standard serialization package in Python is  ``pickle``.  ``pickle``module implements an algorithm for turning an arbitrary Python object into a series of bytes. This process is also called **serializing** the object. 

The byte stream representing the object can then be transmitted or stored, and later reconstructed to create a new object with the same characteristics.

In [None]:
import pickle
pickle.dumps({'The White House': "1600 Pennsylvania Avenue NW, Washington, DC 20500"})

In [None]:
pickle.loads("(dp0\nS'The White House'\np1\nS'1600 Pennsylvania Avenue NW, Washington, DC 20500'\np2\ns.")

In [None]:
favorite_color = { "lion": "yellow", "kitty": "red" }
myfile = open( "save.pkl", "wb" )
pickle.dump(favorite_color, myfile)
myfile.close()

In [None]:
myfile = open( "save.pkl", "rb" )
favorite_color = pickle.load( myfile )
myfile.close()
favorite_color

## Timing Code and other magic commands

In [None]:
%reset

In [None]:
import numpy as np

def p1(x, coef):
    return sum(a * x**i for i, a in enumerate(coef))

def p2(x, coef):
    X = np.empty(len(coef))
    X[0] = 1
    X[1:] = x
    y = np.cumprod(X)   # y = [1, x, x**2,...]
    return np.dot(coef, y)

In [None]:
coef = np.random.randn(1000)
%timeit p1(0.9, coef)

In [None]:
%timeit p2(0.9, coef)

In [None]:
%whos

In [None]:
%%writefile test.py
import numpy as np
def p1(x, coef):
    return sum(a * x**i for i, a in enumerate(coef))

coef = np.random.randn(1000)
print p1(0.9, coef)

In [None]:
%run test.py

In [None]:
# %load test.py
import numpy as np
def p1(x, coef):
    return sum(a * x**i for i, a in enumerate(coef))

coef = np.random.randn(1000)
print p1(0.9, coef)

# Advanced programming for data analysis: NumPy.


<div class = "alert alert-success">  **NumPy** is an open-source add-on module to Python that provide common mathematical and numerical routines in pre-compiled, fast functions.</div>

There are several ways to import NumPy. The standard approach is: 

In [None]:
import numpy as np

The central feature of NumPy is the **array** object class. Arrays are similar to lists in Python, except that every element of an array must be of the same type, typically a numeric type like float or int. 

Arrays make operations with large amounts of numeric data very fast and are generally much more efficient than lists. 

### Array creation

There are several ways for creating an array:
    
+ Explicitly from a list of values.
+ As a range of values.
+ As a random array.
+ By specifying the number of elements.
+ By creating an unitializated, a zero-initializated or one-initializated array.
+ By creating a constant diagonal value array.
+ Etc.



In [None]:
a = np.array([1, 2, 4, 8, 16], float) 
print a[:2], len(a) 

In [None]:
print np.arange(5, dtype=float)  
print np.ones((2,3), dtype=float)  
print np.zeros(7, dtype=int)
a= np.array([[1, 2, 3], [4, 5, 6]], float) 
print np.zeros_like(a) 
print np.identity(4, dtype=str) 

In [None]:
a = np.linspace(0,100,10)
print a

In [None]:
a = np.linspace(10,100,10)
print a

In [None]:
np.random.rand(2,3) 

In [None]:
a = np.random.normal(size=5) 
b = np.random.normal(5, size=5)         # mean = 5
print a
print b

In [None]:
np.random.normal?

Arrays can be multidimensional. **Unlike lists**, different axes are accessed using commas inside 
bracket notation.

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6]], int)
print a[0,0]                # different axes are accessed using commas inside bracket notation
print a[-1:,-2:]

Object arrays have several properties:

In [None]:
print a.shape, a.dtype

The ``in`` statement can be used to test if values are present in an array: 


In [None]:
2 in a 

Arrays can be reshaped using tuples that specify new dimensions. 

In [None]:
a = a.reshape((3, 2))    # this function creates a new array!
print a

Lists can also be created from arrays: 

In [None]:
a = np.array([1, 2, 3], float) 
type(a.tolist())

Other functions and properties:

In [None]:
a = np.array(range(6), float).reshape((2, 3)) 
a.transpose()   # Transposed versions of arrays

In [None]:
a.flatten()    # One-dimensional versions of multi-dimensional arrays

In [None]:
a = np.array([1,2], float) 
b = np.array([3,4,5,6], float) 
c = np.array([7,8,9], float) 
np.concatenate((a, b, c))          # Concatenation

In [None]:
a = np.array([[1, 2], [3, 4]], float) 
b = np.array([[5, 6], [7,8]], float) 
np.concatenate((a,b), axis=0)      # it is possible to specify the axis for concatenation

In [None]:
np.concatenate((a,b), axis=1)

In [None]:
a = np.array([2, 4, 3], float) 
print a.sum(), a.prod(), a.mean(), a.var(), a.std(), a.min(), a.max()
print a.argmin(), a.argmax()     # these functions return the array indices 

One dimensional arrays have a 1-tuple for their shape: <code> (size,)</code>.

N-dimensional arrays have a N-tuple for their shape: <code> (size1, ..., sizeN)</code>.

For multidimensional arrays, each of the functions thus far described can take an optional 
argument ``axis`` that will perform an operation along only the specified axis, placing the results in a return array. 

In [None]:
a = np.array([[0, 2], [3, -1], [3, 5]], float) 
print a
print a.mean(axis=0), a.mean(axis=1), a.min(axis=1), a.max(axis=0) 

<div class="alert alert-info">`ipythonblocks` is a teaching tool that allows students to experiment with Python flow control concepts and immediately see the effects of their code represented in a colorful, attractive way. BlockGrid objects can be **indexed and sliced like 2D NumPy arrays** making them good practice for learning how to access arrays. </div>


In [None]:
import os
os.chdir('./modules/')
from ipythonblocks import BlockGrid
os.chdir('..')
os.getcwd()

In [None]:
grid = BlockGrid(8, 8, fill=(123, 234, 123))
grid

In [None]:
grid[0, 0] = (0, 0, 0)
grid[0, 2] = (255, 0, 0)
grid[0, 4] = (255, 255, 255)
grid[0, 6] = (0, 150, 150)
grid.show()

In [None]:
from ipythonblocks import colors
grid[1, 1] = colors['Teal']
grid[1, 2] = colors['Thistle']
grid[1, 3] = colors['Peru']
grid.show()

In [None]:
a = grid[0, 0]
a

In [None]:
print a

In [None]:
print grid[0,0] == grid[0,1]    
print grid[0,1] == grid[0,3]

### Indexing and Slicing

Array views contain a pointer to the original data, but may have different
shape or stride values.

<div class="alert alert-success"> **Simple assigments do not make copies of arrays. Slicing operations do not make copies either; they return views on the original array**.</div>

In [None]:
a = np.array(range(64)).reshape((8,8))
print a

In [None]:
print a[0:2,:] 

In [None]:
grid = BlockGrid(8, 8, fill=(123, 234, 123))
grid[0:2,:] = colors['Teal']
grid.show()

In [None]:
print a[2,1:]  # This is an element!

In [None]:
grid[2,1:] = colors['Teal']
grid.show()

In [None]:
print a[0,1:] # This is an element!

In [None]:
print a[:2,2:3]

In [None]:
grid[:2,2:3] = colors['Peru']
grid.show()

In [None]:
print a[:,::2]

In [None]:
grid[:,::2] = colors['Peru']
grid.show()

In [None]:
print a[::2,::3]

In [None]:
grid[::2,::3] = (255, 0, 0)
grid.show()

### Exercise

By using a grid representation:
+ Build a graphical representation of all multiples of 3 numbers from 0 to 49 by using exclusively the ``slicing`` operator (no iterations). ``BlockGrid(50, 1, block_size=10, fill=(123, 234, 123))``
+ Build a graphical representation of the prime numbers from 0 to 4999. (Hint: Compute the list of prime numbers and map this list to the grid representation). ``BlockGrid(50, 100, block_size=10, fill=(123, 234, 123))``

In [None]:
# Your solution here

### Array operations

Standard mathematical operations are applied on an **element-by-element basis**.

In [None]:
a = np.array([1,2,3], float) 
b = np.array([5,2,6], float) 
print a + b, a % b, a ** b

For two-dimensional arrays, multiplication remains elementwise and does not correspond to 
matrix multiplication. There are special functions for matrix math. 

In [None]:
a = np.array([[1,2], [3,4]], float) 
b = np.array([[2,0], [1,3]], float) 
a * b 

Arrays that do not match in the number of dimensions will be **broadcasted** by Python 
to perform mathematical operations. This often means that the smaller array will be repeated 
as necessary to perform the operation indicated.

In [None]:
a = np.array([[1, 2], [3, 4], [5, 6]], float) 
b = np.array([-1, 3], float) 
print a , b
print 
print a + b 

Sometimes, however, we can use the ``newaxis`` constant to specify how we 
want to broadcast:

In [None]:
a = np.zeros((2,2), float) 
b = np.array([-1., 3.], float) 
print a, b
print
print a + b 
print
print a + b[np.newaxis,:] 
print
print a + b[:,np.newaxis] 


NumPy offers a large library of common mathematical functions that can be applied elementwise to arrays. Among these are the functions: <code> abs,sign, sqrt, log, log10, exp, sin, cos, tan, arcsin, arccos, arctan, sinh, cosh, tanh, arcsinh, arccosh, </code> and <code>arctanh </code>. 


In [None]:
a = np.array([1, 4, 9], float) 
np.sin(a)

It is possible to iterate over arrays in a manner similar to that of lists: 

In [None]:
a = np.array([[1, 2], [3, 4], [5, 6]], float) 
for x in a: 
    print x

In [None]:
for [x, y] in a:             # Multiple assignment can also be used with array iteration
    print x * y

In [None]:
a = np.array([6, 2, 5, -1, 0], float) 
a.sort() 
print a

### Comparison operators and value testing 
Boolean comparisons can be used to compare members elementwise on arrays of equal size. 


In [None]:
a = np.array([1, 3, 0], float) 
b = np.array([0, 3, 2], float) 
print a > b 
print a == b 
print a <= b 


Arrays can be compared to single values using broadcasting: 

In [None]:
a = np.array([1, 3, 0], float) 
a > 2

The <code>any</code> and <code>all</code> operators can be used to determine whether or not any or all elements of a 
Boolean array are true: 

In [None]:
c = np.array([ True, False, False], bool) 
print any(c), all(c)

Compound Boolean expressions can be applied to arrays on an element-by-element basis using 
special functions ``logical_and``, ``logical_or``, and ``logical_not``. 

In [None]:
a = np.array([1, 3, 0], float) 
print np.logical_and(a > 0, a < 3) 
b = np.array([True, False, True], bool) 
print np.logical_not(b) 
c = np.array([False, True, False], bool) 
print np.logical_or(b, c) 


The ``where`` function forms a new array from two arrays of equivalent size using a Boolean filter  to choose between elements of the two. Its basic syntax is: <br>
<code>where(boolarray, truearray, falsearray)</code>

In [None]:
a = np.array([1, 3, 0], float) 
b = np.where(a != 0, 1 / a, a) 
print b


### Array item selection and manipulation 
We can use array selectors to **filter** for specific subsets of elements of other arrays. 

In [None]:
a = np.array([[6, 4], [5, 9]], float) 
a >= 6

In [None]:
a[a >= 6] 

In [None]:
a = np.array([[6, 4], [5, 9]], float) 
sel = (a >= 6) 
a[sel] 

It is also possible to select using integer arrays that represent indexes.

In [None]:
a = np.array([2, 4, 6, 8], float) 
b = np.array([0, 0, 1, 3, 2, 1], int)  # the 0th, 0th, 1st, 3rd, 2nd, and 1st elements of a
a[b] 

For multidimensional arrays, we have to send multiple one-dimensional integer arrays to the 
selection bracket, one for each axis.

In [None]:
a = np.array([[1, 4], [9, 16]], float) 
b = np.array([0, 0, 1, 1, 0], int) 
c = np.array([0, 1, 1, 1, 1], int) 
a[b,c] 


### Vector and matrix mathematics

In [None]:
a = np.array([1, 2, 3], float) 
b = np.array([0, 1, 1], float) 
np.dot(a, b) 


In [None]:
a = np.array([[0, 1], [2, 3]], float) 
b = np.array([2, 3], float) 
c = np.array([[1, 1], [4, 0]], float) 
print np.dot(b, a)
print np.dot(a, c) 

It is also possible to generate inner, outer, and cross products of matrices and vectors.

In [None]:
a = np.array([1, 4, 0], float) 
b = np.array([2, 2, 1], float) 
np.outer(a, b) 

print np.inner(a, b) 

print np.cross(a, b) 

NumPy also comes with a number of built-in routines for linear algebra calculations. 

In [None]:
a = np.array([[4, 2, 0], [9, 3, 7], [1, 2, 1]], float)
np.linalg.det(a)                 # These can be found in the sub-module linalg

In [None]:
vals, vecs = np.linalg.eig(a)
print vals
print
print vecs

In [None]:
b = np.linalg.inv(a) 
print b

### Vectorize to Avoid Unnecessary Loops!

In [None]:
import numpy as np

def pydot(a, b):
    M,N = np.shape(a)
    P,Q = np.shape(b)
    c = np.zeros((M,Q))
    for i in xrange(M):
        for j in xrange(Q):
            for k in xrange(N):
                c[i,j] += a[i,k] * b[k,j]
    return c

a = np.random.randn(100,100)
b = np.random.randn(100,100)
%timeit pydot(a,b)

In [None]:
%timeit np.dot(a,b)

In [None]:
%%timeit
n = 100000
sum = 0
for i in range(n):
    x = random.uniform(0, 1)
    sum += x**2

Note how ``%%`` in front of `timeit` converts this line magic into a cell magic.

In [None]:
%%timeit
n = 100000
x = np.random.uniform(0, 1, n)
np.sum(x**2)

## Exercise

Rewrite (and compare running time) the following functions so that it is fully vectorized: that is, so that it consists of a sequence of NumPy operations on whole arrays, with no native Python loops.

In [None]:
def sumproducts(x, y):
    """Return the sum of x[i] * y[j] for all pairs of indices i, j.

    >>> sumproducts(np.arange(3000), np.arange(3000))
    20236502250000

    """
    result = 0
    for i in range(len(x)):
        for j in range(len(y)):
            result += x[i] * y[j]
    return result

%timeit sumproducts(np.arange(3000), np.arange(3000))

In [None]:
# Your code here

In [None]:
def countlower(x, y):
    """Return the number of pairs i, j such that x[i] < y[j].

    >>> countlower(np.arange(0, 200, 2), np.arange(40, 140))
    4500

    """
    result = 0
    for i in range(len(x)):
        for j in range(len(y)):
            if x[i] < y[j]:
                result += 1
    return result

%timeit countlower(np.arange(0, 200, 2), np.arange(40, 140))

In [None]:
# Your code here

## Exercise 

In the following table we have expression values for 5 genes at 4 time points. 

                        Gene name   4h	12h	  24h	48h
                        A2M        0.12	0.08  0.06	0.02
                        FOS        0.01	0.07  0.11	0.09
                        BRCA2      0.03	0.04  0.04	0.02
                        CPOX       0.05	0.09  0.11	0.14

+ Create a single array for the data (4x4)
+ Find the mean expression value per gene
+ Find the mean expression value per time point
+ Which gene has the maximum mean expression value? (Use the ``tab`` help on an array)

In [None]:
# Your solution here

In [None]:
print genes.mean(axis=1)

In [None]:
print genes.mean(axis=0)

In [None]:
names[genes.argmax()/4]

### Exercise

Consider the polynomial

$$ p(x) = a_0 + a_1 x + \dots + a_n x^n $$

Earlier, you wrote a simple function ``p(x, coeff)`` to evaluate it without considering efficiency. 

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

Hint: Use ``np.cumprod()``

In [None]:
a = np.array([1,2,3])
np.cumprod(a)

In [None]:
# Your solution here

## Advanced programming for data analysis: pandas.

pandas is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

+ Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
+ Ordered and unordered (not necessarily fixed-frequency) time series data.
+ Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.


<div class="alert alert-success"> **pandas is a Python package providing fast, flexible, and expressive data structures designed to work with both relational or labeled data. **.</div>


Key features:

+ Easy handling of missing data
+ Size mutability: columns can be inserted and deleted from DataFrame
+ Powerful, flexible ``group by`` functionality to perform split-apply-combine operations on data sets
+ Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
+ Intuitive merging and joining data sets
+ Flexible reshaping and pivoting of data sets
+ Hierarchical labeling of axes
+ Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
+ Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Pandas has a lot of functionality, so we'll only be able to cover a **small** fraction of what you can do. Check out the (very readable) pandas docs if you want to learn more:

http://pandas.pydata.org/pandas-docs/stable/

### Series and DataFrames

A Series is a single vector of data with an index that labels every element in the vector. 

If we do not specify the index, a sequence of integers is assigned as index.

In [None]:
import pandas as pd                            # convention, alias 'pd'
c = pd.Series([1956, 1967, 1989, 2000])
c

Its values are stores in a NumPy array (``values``) and the index in a pandas ``Index`` object:

In [None]:
c.values

In [None]:
c.index

In [None]:
c = pd.Series([1956, 1967, 1989, 2000], index = ['a','b','c','d'])
c

In [None]:
print c['d'], c[3]

DataFrames are designed to store heterogeneous multivariate data.

In [None]:
c = pd.DataFrame({'VarA':['aa','bb'], 'VarB':[22.2,33.3]}, \
                 index = ['Case1','Case2'])
c

A DataFrame has a second index representing columns:

In [None]:
c.columns

We can access **columns** as in a dictionary or with the ``.`` notation:

In [None]:
c.VarA

In [None]:
c['VarA']

These two methods return a ``Series`` object:

In [None]:
type(c.VarA), type(c['VarA'])

If we want to access a **row** in a DataFrame we can index its ``ix`` attribute:

In [None]:
print c.ix[0]
print c.ix['Case1']

Or to use the ``irow`` *function*, that let's you grab the ith row from a DataFrame:

In [None]:
print c.irow(0)

### Reading tabular data from a file

The ‘pandas’ Python library provides several operators, <code>read_csv(), read_tab(), ...</code> that allows you to access data ﬁles in tabular format on your computer as well as data stored in web repositories.

Reading in a data table is simply a matter of knowing the name (and location) of the data set.

In [None]:
import pandas as pd
import numpy as np

# Set some Pandas options
pd.set_option('html', False)
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 20)

In [None]:
import pandas as pd
data = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
data.shape   # an attribute to see how many cases and variables there are in a data frame

In [None]:
type(data)

In [None]:
data.head()  # display of the firsts rows

In [None]:
# Load car dataset
pd.set_option('html', True)
auto = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
auto.tail()  # print the last lines

Tabular data generally, involve variables and cases. 

In ‘pandas’ data frames, each of the variables is given a name. You can refer to the variable by name in a couple of diﬀerent ways:

+ We can see the variable names in a data frame by using the columns attribute of the data frame object:

In [None]:
data.columns  # This is not a function; it is an attribute of the data frame.

+ Another way to get quick information about the variables in a data frame is with the function <code>describe()</code>:

In [None]:
data.describe()  # the output from describe() is itself a data frame.

### Reading from the clipboard

We can also read data directly from the clipboard.

In [None]:
# we can select a part of a DataFrame of this page to build a new one.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
bedford = pd.read_clipboard()
bedford.head()

### Reading from APIs

We can read online data from public API's such as http://www.citybik.es/

In [None]:
import pandas as pd
import seaborn as sn
d = pd.read_json('http://api.citybik.es/bicing.json')   
d.info()

In [None]:
d.tail()

In [None]:
%matplotlib inline
import matplotlib.pylab as plt
a = plt.hist(d.bikes, bins=10)

### ‘pandas’ methods and ‘numpy’ operators and other functions.

There are a lot of methods that can be applied to data:

In [None]:
data['year']

In [None]:
print data.year.count()    # number of non-NaN values
print data.year.mean()     # mean value 
print data.year.argmin()   # index location at which min is obtained
print data.year.min()      # min value
print data.year.sum()   # sum of values

It is also possible to combine ‘numpy’ operators with ‘pandas’ variables:

In [None]:
np.min(data["year"])

When you encounter a function that isn’t supported by data frames, you can use ‘numpy’ functions or the special <code>apply</code> function built-into data frames.

Using the ``apply()``method, which takes an anonymous function, we can apply any function to each value in a column.

In [None]:
c = pd.DataFrame({'VarA':['aa','bb'], 'VarB':[20,30]}, index = ['Case1','Case2'])

def f(a):
    return a**2

c.VarB = c.VarB.apply(lambda d: f(d))
c

In [None]:
data.year.apply(np.sqrt).head(10) 

Alternatively, since columns are basically just arrays, we can use built-in numpy functions directly on the columns:

In [None]:
np.sqrt(data.year).head()

### Aggregating

<code>groupby</code> is the ‘pandas’ way of grouping or aggregating data frames by columns.

You can construct statements that involve more than one column within a data frame. 

For instance, here’s a calculation of the mean year, separately for (grouping by) the different sexes:

In [None]:
data.groupby('sex')['year'].mean()

In [None]:
res = data.groupby(['sex','year'])['time'].mean()
res.tail(20)

You can iterate through the result of a ``groupby`` (that returns a tuple). The first item is the column values and the second is a filtered dataframe. 

In [None]:
data.head()

In [None]:
dic = {}
for i,j in data.groupby('sex'):
    dic[i] = j.mean()
    
print dic,

You can group by more than one column as well: the first tuple item will itself be a tuple with the value of each column.

In [None]:
auto.head()

In [None]:
dic = {}
for (i,j), k in auto.groupby(['horsepower','cylinders']):
    dic[i,j] = k.mean()
    
dic

### Adding variables

Adding a new variable to a data frame can be done similarly to accessing a variable. For instance, here is how to create a new variable in <code>data</code> that holds the <code>time</code> converted from seconds to units of minutes:

In [None]:
data['minutes'] = data.time/60. # or data['time']/60.

By default, columns get inserted at the end. 

The <code>insert</code> function is available to insert at a particular location in the columns.

In [None]:
data.insert(1, 'mins', data.time/60.)

You could also, if you want, redeﬁne an existing variable, for instance:

In [None]:
data['time'] = data.time/60.
data.head()

The <code>ix[]</code> can also be used to get a subset of the dataframe:

In [None]:
kids = pd.read_csv("http://www.mosaic-web.org/go/datasets/kidsfeet.csv")
kids.shape

In [None]:
kids.index

In [None]:
# random sample of 5 cases from this data frame
rows = np.random.choice(kids.index, 5, replace=False) 

kids.ix[rows]

The results returned by the above methods will never contain the same case more than once (because we told the function not to sample with replacement). 

In contrast, ‘re-sampling with replacement’ replaces each case after it is dealt so that it can appear more than once in the result. You wouldn’t want to do this to select from a sampling frame, but it turns out that there are valuable statistical uses for this sort of sampling with replacement.

### Droping rows and columns.

We can delete entire rows and columns:

In [None]:
c = pd.DataFrame({'VarA':['aa','bb'], 'VarB':[20,30]}, index = ['Case1','Case2'])
c = c.drop(['Case1'])
c

In [None]:
c = pd.DataFrame({'VarA':['aa','bb'], 'VarB':[20,30]}, index = ['Case1','Case2'])
c = c.drop(['VarA'], axis=1)
c

### Handling missing values.

Pandas considers that values ``NaN`` and ``None`` represents missing data. 

The ``pandas.isnull`` function can be used to tell whether or not a value is missing.

In [None]:
c = pd.DataFrame({'VarA':['aa', np.nan, 'cc'], 'VarB':[20,30,np.nan], \
                  'VarC':[1234, 3456, 6789]}, index = ['Case1','Case2','Case3'])
c

In [None]:
empty = c.apply(lambda c: pd.isnull(c))
empty     # Empty returns a boolean dataframe

One option is to drop all rows with missing event values using the ``dropna`` function:

In [None]:
c.dropna(subset=["VarA","VarB", "VarC"])   # Returns a dataframe copy

Another option od to use the ``fillna`` function to fillthem with empty strings:

In [None]:
c = pd.DataFrame({'VarA':['aa', np.nan, 'cc'], 'VarB':[20,30,55], 'VarC':[1234, 3456, 6789]}, \
                 index = ['Case1','Case2','Case3'])
c = c.fillna("")
c

### Filtering

One of the main tasks when analyzing a dataset is to select rows by using simple operators. 

In [None]:
vc = c[c.VarC > c.VarC.mean()]
vc

By using Boolean operators we can build combinde filters.

In [None]:
vc = c[(c.VarC < c.VarC.mean()) & (c.VarC >= 1000)]
vc

A filter is a ``Series`` object with Boolean values:

In [None]:
filt = c.VarC < c.VarC.mean()
filt

In [None]:
vc = c[filt]
vc

We can use the ``any`` funtion to see if there is any ``True`` value in a filter:

In [None]:
filt.any()

### Writting data

In [None]:
bedford.to_csv("files/bedford2.csv")

In [None]:
bedford.to_csv("files/bedford3.csv", sep="\t")

### Exercise

+ Read the titanic dataset from ``files/titanic.xls`` and inspect the first records.
+ Are there columns have NaN values? 
+ Drop those rows with NaN values in ``age``. 
+ What was the probability of survival? Get a variable with this value.
+ What was the probability of survival for each ``pclass``? Get a varible with this value.
+ What was the mean age for third class survivors? Get a variable with this value.


In [None]:
# Your solution here

# Visualization

In [None]:
import pandas as pd  
import matplotlib.pylab as plt
%matplotlib inline

# Load car dataset
auto = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")

# create a scatterplot of weight vs "miles per galone"
auto.plot(x='weight', y='mpg', style='bo', alpha=0.4)
plt.title("Scatterplot of weight and mpg")

# create a histogram of "miles per galone"
plt.figure()
auto.hist('mpg', alpha=0.7)
plt.title("Histogram of mpg (miles per galone)")


In [None]:
import numpy as np
import pandas as pd
var = pd.DataFrame({'n': np.random.normal(size=100),
                    'g': np.random.gamma(1, size=100),
                    'p': np.random.poisson(size=100)})
var.cumsum(0).plot()

In [None]:
var.cumsum(0).plot(subplots=True)

In [None]:
var.cumsum(0).plot(secondary_y='n')

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,4))
for i,var2 in enumerate(['n','g','p']):
    var[var2].cumsum(0).plot(ax=axes[i], title=var2)
axes[0].set_ylabel('cum sum')

In [None]:
titanic = pd.read_excel('files/titanic.xls','titanic')

In [None]:
!pip install missingno

In [None]:
import missingno as msno
%matplotlib inline

# The msno.matrix nullity matrix is a data-dense display which lets you 
# quickly visually pick out patterns in data completion.
msno.matrix(titanic.sample(250))

In [None]:
# The missingno correlation heatmap lets you measure how strongly the presence of one variable positively 
# or negatively affect the presence of another:

msno.heatmap(titanic)

In [None]:
titanic.groupby('pclass').survived.sum().plot(kind='bar',alpha=0.4)

In [None]:
titanic.groupby(['sex','pclass']).survived.sum().plot(kind='barh', alpha=0.4)

In [None]:
death_counts = pd.crosstab([titanic.pclass, titanic.sex], titanic.survived.astype(bool))
death_counts.plot(kind='bar',stacked=True, color=['lightgreen','lightblue'], grid=False)

In [None]:
death_counts.div(death_counts.sum(1).astype(float), axis=0).plot(kind='barh', \
            stacked=True, color=['grey','gold'])

In [None]:
titanic.fare.hist(bins=10, grid=False, alpha=0.5), titanic.fare.hist(bins=30, alpha=0.5)

There are algorithms for determining an "optimal" number of bins, each of which varies somehow with the number of observations in the data series. It is always advisable to "explore" different numbers of bins.

In [None]:
import numpy as np
from scipy import stats

np.random.seed(0)
x = np.concatenate([stats.cauchy(-5, 1.8).rvs(500),
                    stats.cauchy(-4, 0.8).rvs(2000),
                    stats.cauchy(-1, 0.3).rvs(500),
                    stats.cauchy(2, 0.8).rvs(1000),
                    stats.cauchy(4, 1.5).rvs(500)])

# truncate values to a reasonable range
x = x[(x > -15) & (x < 15)]
# Histogram the result
import pylab as pl
H = pl.hist(x, normed=True, alpha=0.5)

In [None]:
# Histogram with more bins
H = pl.hist(x, bins=100, alpha=0.5, normed=True)
#pl.savefig('bayesblocks2.png')

In [None]:
def bayesian_blocks(t):
    """Bayesian Blocks Implementation

    By Jake Vanderplas.  License: BSD
    Based on algorithm outlined in http://adsabs.harvard.edu/abs/2012arXiv1207.5578S

    Parameters
    ----------
    t : ndarray, length N
        data to be histogrammed

    Returns
    -------
    bins : ndarray
        array containing the (N+1) bin edges

    Notes
    -----
    This is an incomplete implementation: it may fail for some
    datasets.  Alternate fitness functions and prior forms can
    be found in the paper listed above.
    """
    # copy and sort the array
    t = np.sort(t)
    N = t.size

    # create length-(N + 1) array of cell edges
    edges = np.concatenate([t[:1],
                            0.5 * (t[1:] + t[:-1]),
                            t[-1:]])
    block_length = t[-1] - edges

    # arrays needed for the iteration
    nn_vec = np.ones(N)
    best = np.zeros(N, dtype=float)
    last = np.zeros(N, dtype=int)

    #-----------------------------------------------------------------
    # Start with first data cell; add one cell at each iteration
    #-----------------------------------------------------------------
    for K in range(N):
        # Compute the width and count of the final bin for all possible
        # locations of the K^th changepoint
        width = block_length[:K + 1] - block_length[K + 1]
        count_vec = np.cumsum(nn_vec[:K + 1][::-1])[::-1]

        # evaluate fitness function for these possibilities
        fit_vec = count_vec * (np.log(count_vec) - np.log(width))
        fit_vec -= 4  # 4 comes from the prior on the number of changepoints
        fit_vec[1:] += best[:K]

        # find the max of the fitness: this is the K^th changepoint
        i_max = np.argmax(fit_vec)
        last[K] = i_max
        best[K] = fit_vec[i_max]
    
    #-----------------------------------------------------------------
    # Recover changepoints by iteratively peeling off the last block
    #-----------------------------------------------------------------
    change_points =  np.zeros(N, dtype=int)
    i_cp = N
    ind = N
    while True:
        i_cp -= 1
        change_points[i_cp] = ind
        if ind == 0:
            break
        ind = last[ind - 1]
    change_points = change_points[i_cp:]

    return edges[change_points]

In [None]:
# plot a standard histogram in the background, with alpha transparency
H1 = plt.hist(x, bins=200, histtype='stepfilled', alpha=0.3, normed=True)
# plot an adaptive-width histogram on top
H2 = plt.hist(x, bins=bayesian_blocks(x), color='red', histtype='step', normed=True)
#pl.savefig('bayesblocks3.png')

A different way of visualizing the distribution of data is the ``boxplot``, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

In [None]:
titanic.boxplot(column='fare', by='pclass', grid=False)

In [None]:
bp = titanic.boxplot(column='age', by='pclass', grid=False)
for i in [1,2,3]:
    y = titanic.age[titanic.pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)

Scatterplots

In [None]:
baseball = pd.read_csv("files/baseball.csv")
baseball.head()

In [None]:
plt.scatter(baseball.ab, baseball.h)
plt.xlim(0, 700); plt.ylim(0, 200)

In [None]:
plt.scatter(baseball.ab, baseball.h, s=baseball.hr*10, alpha=0.3)
plt.xlim(0, 700); plt.ylim(0, 200)

In [None]:
plt.scatter(baseball.ab, baseball.h, c=baseball.hr, s=baseball.hr*10, alpha=0.5, cmap='hot')
plt.xlim(0, 700); plt.ylim(0, 200);

To view scatterplots of a large numbers of variables simultaneously, we can use the scatter_matrix function that was recently added to Pandas. It generates a matrix of pair-wise scatterplots, optiorally with histograms or kernel density estimates on the diagonal.

In [None]:
_ = pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(12,8), diagonal='kde')

### Exercise

Movielens 1M database (http://www.grouplens.org/node/73) stores 1,000,209 scorings from 3.900 films that were compiled in 2000 from 6.040 anonymou users of the online MovieLens recommender (http://www.movielens.org/).

We can read the database:

In [None]:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('files/ml-1m/users.dat', sep='::', header=None, names=unames, engine='python')
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('files/ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine='python')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('files/ml-1m/movies.dat', sep='::', header=None, names=mnames, engine='python')

+ What is the mean score of each user?

In [None]:
# Your code here

+ Write a function that given a user returns the movie with highest score.

In [None]:
# Your code here

Let's build a very basic movie recommender! 

First, we will define training and testing datasets.

In [None]:
import numpy as np

def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df
data = pd.merge(pd.merge(ratings, users), movies)
data['for_testing'] = False
grouped = data.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = data[grouped.for_testing == False]
movielens_test = data[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape

``evaluate`` will compute the precision of the recommender system by using the RMSE metric:

In [None]:
def compute_rmse(y_pred, y_true):
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

def evaluate(estimate,test=movielens_test):
    ids_to_estimate = zip(test['user_id'], test['movie_id'])
    estimated = np.array([estimate(u,i) for (u,i) in ids_to_estimate])
    real = test.rating.values
    return compute_rmse(estimated, real)

+ Write a function that, given a user, scores any film with the mean scoring of that user.

In [None]:
def rec1(user_id, item_id,train=movielens_train):
    #Put your recomender here
    return 1

print 'Error: %s' % evaluate(rec1)

# Matplotlib

Matplotlib is a plotting library. In this section give a brief introduction to the `matplotlib.pyplot` module, which provides a plotting system similar to that of MATLAB.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

The most important function in `matplotlib` is plot, which allows you to plot 2D data. Here is a simple example:

In [None]:
# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)

With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend, and axis labels:

In [None]:
y_cos = np.cos(x)
y_sin = np.sin(x)
# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])

### Subplots 

You can plot different things in the same figure using the subplot function. Here is an example:

In [None]:
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

In [None]:
plt.style.use('seaborn-whitegrid')
import numpy as np

x = np.linspace(0, 10, 30)
y = np.sin(x)

plt.plot(x, y, 'o', color='black');

In [None]:
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marker,
             label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);

In [None]:

rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
            cmap='viridis')
plt.colorbar();  # show color scale

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T

plt.scatter(features[0], features[1], alpha=0.2,
            s=100*features[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);

In [None]:

x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)

plt.errorbar(x, y, yerr=dy, fmt='.k');

In [None]:
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
             ecolor='lightgray', elinewidth=3, capsize=0);

You can read much more about the `matplotlib` in the [documentation](http://matplotlib.org/api/pyplot_api.html).

# Sklearn

Supervised learning:

+ Linear models (Ridge, Lasso, Elastic Net, ...)
+ Support Vector Machines
+ Tree-based methods (Random Forests, Bagging, GBRT, ...)
+ Nearest neighbors
+ Neural networks
+ Gaussian Processes
+ Feature selection

Unsupervised learning:
+ Clustering (KMeans, Ward, ...)
+ Matrix decomposition (PCA, ICA, ...)
+ Density estimation
+ Outlier detection

Model selection and evaluation:
+ Cross-validation
+ Grid-search
+ Lots of metrics

... and many more! (See [http://scikit-learn.org/])

In [None]:
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

In [None]:
%matplotlib inline
import seaborn as sns; sns.set()
sns.pairplot(iris, hue='species', size=1.5);

In [None]:
X_iris = iris.drop('species', axis=1)
X_iris.shape

In [None]:
y_iris = iris['species']
y_iris.shape

In [None]:
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model

In [None]:
X = x[:, np.newaxis]
X.shape

In [None]:
model.fit(X, y)

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
xfit = np.linspace(-1, 11)

In [None]:
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)

In [None]:
plt.scatter(x, y)
plt.plot(xfit, yfit);

In [None]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,random_state=1)

In [None]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

In [None]:
from sklearn.mixture import GMM      # 1. Choose the model class
model = GMM(n_components=3,
            covariance_type='full')  # 2. Instantiate the model with hyperparameters
model.fit(X_iris)                    # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X_iris)        # 4. Determine cluster labels

In [None]:
from sklearn.decomposition import PCA  # 1. Choose the model class
model = PCA(n_components=2)            # 2. Instantiate the model with hyperparameters
model.fit(X_iris)                      # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X_iris)         # 4. Transform the data to two dimensions

iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);

iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=iris, hue='species', col='cluster', fit_reg=False);

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)

In [None]:
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, clim=(y.min(), y.max()),
                           zorder=1)

    ax.set(xlim=xlim, ylim=ylim)

visualize_classifier(DecisionTreeClassifier(), X, y)


In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)
visualize_classifier(model, X, y);

In [None]:
rng = np.random.RandomState(42)
x = 10 * rng.rand(200)

def model(x, sigma=0.3):
    fast_oscillation = np.sin(5 * x)
    slow_oscillation = np.sin(0.5 * x)
    noise = sigma * rng.randn(len(x))

    return slow_oscillation + fast_oscillation + noise

y = model(x)
plt.errorbar(x, y, 0.3, fmt='o');

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(200)
forest.fit(x[:, None], y)

xfit = np.linspace(0, 10, 1000)
yfit = forest.predict(xfit[:, None])
ytrue = model(xfit, sigma=0)

plt.errorbar(x, y, 0.3, fmt='o', alpha=0.5)
plt.plot(xfit, yfit, '-r');
plt.plot(xfit, ytrue, '-k', alpha=0.5);