#  NumPy
**NumPy is the fundamental package for scientific computing with Python.**

**Installation instructions:** http://www.scipy.org/scipylib/download.html

In [None]:
import numpy as np

## Arrays
**A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers.**

In [None]:
arr = np.array([1, 2, 3])     
a = np.zeros((2, 2))        # Create an array of all ones
b = np.ones((1, 2))         # Create an array of all zeros
c = np.full((2,2), 7)       # Create a constant array
d = np.eye(2)               # Create 2x2 identity matrix
e = np.random.random((2,2)) # Create an array filled with random values
my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

In [None]:
print arr
print
print a
print
print b
print
print c
print
print d
print
print e
print 
print my_array

## Array indexing

**Normal indexing**

In [None]:
my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print my_array[0]    # same as for Python lists
print my_array[-1]   # same as for Python lists
print my_array[0][1] # same as for Python lists
print my_array[0, 1] # new in Numpy arrays

**Integer array indexing**

In [None]:
my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

my_array[[0, 1, 2], [0, 0, 0]]

In [None]:
# Equivalent to this
np.array([my_array[0, 0], my_array[1, 0], my_array[2, 0]])

**Boolean array indexing**

In [None]:
my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

bool_idx = (my_array > 2)
print bool_idx

my_array[bool_idx]

In [None]:
# Or directly
my_array[my_array > 2]

**Slicing**

In [None]:
b = my_array[:2, 1:3] # take first column, 
print my_array
print
print b
print

# A slice of an array is a view into the same data, 
# so modifying it will modify the original array
print my_array[0,1]
b[0, 0] = 55
print my_array[0,1]

## Datatypes

In [None]:
x = np.array([1, 2])
print x.dtype

x = np.array([1.0, 2.0])
print x.dtype

x = np.array([1, 2], dtype=np.int64)
print x.dtype

## Array math

In [None]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Addition
print "Add:"
print x + y    # np.add(x, y)

# Difference
print "Substract:"
print x - y    # np.subtract(x, y)

# Product
print "Multiply: "
print x * y    # np.multiply(x, y)

# Matrix product 
print "Matrix product:"
print x.dot(y) # np.dot(x, y)

# Division
print "Divide:"
print x / y    # np.divide(x, y)

# Square root
print "Square root:"
print np.sqrt(x)

# Sum
print "Sum: "
print np.sum(x)
print np.sum(x, axis=0) # Compute sum of each column
print np.sum(x, axis=1) # Compute sum of each row

## Broadcasting
**Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes (dimensions) when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.**

In [None]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x)   # Create an empty matrix with the same shape as x

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(4):
    y[i, :] = x[i, :] + v
    
print y

This works; however when the matrix x is very large, computing an explicit loop in Python could be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x and vv.

In [None]:
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting

## Functions

* **```add(x1, x2[, out])```** - Add arguments element-wise.
* **```subtract(x1, x2[, out])```** - Subtract arguments, element-wise.
* ** ```multiply(x1, x2[, out])```** - Multiply arguments element-wise.
* **```divide(x1, x2[, out])```** - Divide arguments element-wise.
* **```logaddexp(x1, x2[, out])```** - Logarithm of the sum of exponentiations of the inputs.
* **```logaddexp2(x1, x2[, out])```** - Logarithm of the sum of exponentiations of the inputs in base-2.
* **```true_divide(x1, x2[, out])```** - Returns a true division of the inputs, element-wise.
* **```floor_divide(x1, x2[, out])```** - Return the largest integer smaller or equal to the division of the inputs.
* **```negative(x[, out])```** - Numerical negative, element-wise.
* **```power(x1, x2[, out])```** - First array elements raised to powers from second array, element-wise.
* **```remainder(x1, x2[, out])```** - Return element-wise remainder of division.
* **```mod(x1, x2[, out])```** - Return element-wise remainder of division.
* **```fmod(x1, x2[, out])```** - Return the element-wise remainder of division.
* **```absolute(x[, out])```** - Calculate the absolute value element-wise.
* **```rint(x[, out])```** - Round elements of the array to the nearest integer.
* **```sign(x[, out])```** - Returns an element-wise indication of the sign of a number.
* **```conj(x[, out])```** - Return the complex conjugate, element-wise.
* **```exp(x[, out])```** - Calculate the exponential of all elements in the input array.
* **```exp2(x[, out]) ```**	- Calculate 2**p for all p in the input array.
* **```log(x[, out]) ```** - Natural logarithm, element-wise.
* **```log2(x[, out])```** - Base-2 logarithm of x.
* **```log10(x[, out])```** - Return the base 10 logarithm of the input array, element-wise.
* **```expm1(x[, out])```** - Calculate exp(x) - 1 for all elements in the array.
* **```log1p(x[, out]) ```** - Return the natural logarithm of one plus the input array, element-wise.
* **```sqrt(x[, out])```** - Return the positive square-root of an array, element-wise.
* **```square(x[, out])```** - Return the element-wise square of the input.
* **```reciprocal(x[, out])```** - Return the reciprocal of the argument, element-wise.
* **```ones_like(a[, dtype, order, subok])```** - Return an array of ones with the same shape and type as a given array.

# Matplotlib
**Matplotlib provides a MATLAB-like plotting framework.**

**Installation instructions:** http://matplotlib.org/users/installing.html 

In [212]:
import matplotlib.pyplot as plt

## Plotting

In [213]:
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()

**Plot x versus x**

In [214]:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])

# axis [xmin, xmlax, ymin, ymax]
plt.axis([0, 6, 0, 20])
plt.show()

**With numpy arrays**

In [215]:
import numpy as np

# evenly sampled time at 200 ms intervals
t = np.arange(0., 5., 0.2)

# red dashed, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

## Subplots

In [216]:
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(1)
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()

**Multiple figures**

In [217]:
plt.figure(1)                # the first figure
plt.subplot(211)             # the first subplot in the first figure
plt.plot([1, 2, 3])
plt.subplot(212)             # the second subplot in the first figure
plt.plot([4, 5, 6])


plt.figure(2)                # a second figure
plt.plot([4, 5, 6])          # creates a subplot(111) by default

plt.figure(1)                # figure 1 current; subplot(212) still current
plt.subplot(211)             # make subplot(211) in figure1 current
plt.title('Easy as 1, 2, 3') # subplot 211 title

plt.show()

**Working with text**

In [218]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)


plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$') # TeX equation expressions
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

**Anonating text**

In [219]:
ax = plt.subplot(111)

t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)
line, = plt.plot(t, s, lw=2)

plt.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.ylim(-2,2)
plt.show()

# Pandas
**Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.**

**Installation instructions:** http://pandas.pydata.org/pandas-docs/stable/install.html 

In [220]:
from pandas import *  
import pandas as pd

## Dataframes
**Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure**

### Create Data

In [221]:
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
BabyDataSet = zip(names,births)
BabyDataSet

[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

**Create a DataFrame**

In [222]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df

Unnamed: 0,Names,Births
0,Bob,968
1,Jessica,155
2,Mary,77
3,John,578
4,Mel,973


**Output data to CSV**

In [223]:
df.to_csv('files/births1889_output.csv', index=False)

### Get Data

**Read data from CSV**

In [224]:
df = pd.read_csv("files/births1889_output.csv")
df

Unnamed: 0,Names,Births
0,Bob,968
1,Jessica,155
2,Mary,77
3,John,578
4,Mel,973


**Read data from JSON**

In [225]:
df = pd.read_json('files/simple.json')
df

Unnamed: 0,Age,Name,Role
0,23,Bob,Architect
1,22,Ariel,Intern


**Read data from dict**

In [226]:
data = {
    "Bob": 968,
    "Jessica": 155,
    "Mary": 77,
    "John": 578,
    "Mel": 973
}
df = DataFrame(data.items(), columns=["Names", "Births"])
df

Unnamed: 0,Names,Births
0,Mel,973
1,Bob,968
2,John,578
3,Jessica,155
4,Mary,77


### Indexing

In [227]:
experimentDF =  read_csv("files/parasite_data.csv", na_values=[" "], skiprows=1,names=['Virulence', 'Replicate', 'ShannonDiversity'])
experimentDF

Unnamed: 0,Virulence,Replicate,ShannonDiversity
0,0.5,1,0.059262
1,0.5,2,1.093600
2,0.5,3,1.139390
3,0.5,4,0.547651
4,0.5,5,0.065928
5,0.5,6,1.344330
6,0.5,7,1.680480
7,0.5,8,0.000000
8,0.5,9,2.047680
9,0.5,10,0.000000


**Index by [column]**

In [228]:
experimentDF["Virulence"]

0      0.5
1      0.5
2      0.5
3      0.5
4      0.5
5      0.5
6      0.5
7      0.5
8      0.5
9      0.5
10     0.5
11     0.5
12     0.5
13     0.5
14     0.5
15     0.5
16     0.5
17     0.5
18     0.5
19     0.5
20     0.5
21     0.5
22     0.5
23     0.5
24     0.5
25     0.5
26     0.5
27     0.5
28     0.5
29     0.5
      ... 
320    NaN
321    NaN
322    NaN
323    NaN
324    NaN
325    NaN
326    NaN
327    NaN
328    NaN
329    NaN
330    NaN
331    NaN
332    NaN
333    NaN
334    NaN
335    NaN
336    NaN
337    NaN
338    NaN
339    NaN
340    NaN
341    NaN
342    NaN
343    NaN
344    NaN
345    NaN
346    NaN
347    NaN
348    NaN
349    NaN
Name: Virulence, dtype: float64

**Index by [column][row]**

In [229]:
# show the 12th row in the ShannonDiversity column
experimentDF["ShannonDiversity"][12]

1.5898099999999999

**Index by [boolean expression]**

In [230]:
# show all entries in the ShannonDiversity column > 2.0
print experimentDF[experimentDF["ShannonDiversity"] > 2.0]

     Virulence  Replicate  ShannonDiversity
8          0.5          9           2.04768
89         0.6         40           2.01066
92         0.6         43           2.90081
96         0.6         47           2.02915
105        0.7          6           2.23427
117        0.7         18           2.14296
127        0.7         28           2.23599
129        0.7         30           2.48422
133        0.7         34           2.18506
134        0.7         35           2.42177
139        0.7         40           2.25737
142        0.7         43           2.07258
148        0.7         49           2.38326
151        0.8          2           2.07970
153        0.8          4           2.38474
163        0.8         14           2.03252
165        0.8         16           2.38415
170        0.8         21           2.02297
172        0.8         23           2.13882
173        0.8         24           2.53339
182        0.8         33           2.17865
196        0.8         47       

** NA/NaN values **

In [231]:
print "Mean virulence across all treatments:"
print experimentDF["Virulence"].mean()  

Mean virulence across all treatments:
0.75


In [232]:
# Drop NA
print experimentDF["Virulence"].dropna()

0      0.5
1      0.5
2      0.5
3      0.5
4      0.5
5      0.5
6      0.5
7      0.5
8      0.5
9      0.5
10     0.5
11     0.5
12     0.5
13     0.5
14     0.5
15     0.5
16     0.5
17     0.5
18     0.5
19     0.5
20     0.5
21     0.5
22     0.5
23     0.5
24     0.5
25     0.5
26     0.5
27     0.5
28     0.5
29     0.5
      ... 
270    1.0
271    1.0
272    1.0
273    1.0
274    1.0
275    1.0
276    1.0
277    1.0
278    1.0
279    1.0
280    1.0
281    1.0
282    1.0
283    1.0
284    1.0
285    1.0
286    1.0
287    1.0
288    1.0
289    1.0
290    1.0
291    1.0
292    1.0
293    1.0
294    1.0
295    1.0
296    1.0
297    1.0
298    1.0
299    1.0
Name: Virulence, dtype: float64


In [233]:
# Fill NA
print experimentDF.fillna(0.0)["Virulence"]

0      0.5
1      0.5
2      0.5
3      0.5
4      0.5
5      0.5
6      0.5
7      0.5
8      0.5
9      0.5
10     0.5
11     0.5
12     0.5
13     0.5
14     0.5
15     0.5
16     0.5
17     0.5
18     0.5
19     0.5
20     0.5
21     0.5
22     0.5
23     0.5
24     0.5
25     0.5
26     0.5
27     0.5
28     0.5
29     0.5
      ... 
320    0.0
321    0.0
322    0.0
323    0.0
324    0.0
325    0.0
326    0.0
327    0.0
328    0.0
329    0.0
330    0.0
331    0.0
332    0.0
333    0.0
334    0.0
335    0.0
336    0.0
337    0.0
338    0.0
339    0.0
340    0.0
341    0.0
342    0.0
343    0.0
344    0.0
345    0.0
346    0.0
347    0.0
348    0.0
349    0.0
Name: Virulence, dtype: float64


### Selecting

### Sorting

In [234]:
experimentDF_sorted = experimentDF.sort_values(by="ShannonDiversity", ascending=False)
experimentDF_sorted.head(1)

Unnamed: 0,Virulence,Replicate,ShannonDiversity
92,0.6,43,2.90081


In [235]:
experimentDF['ShannonDiversity'].max()

2.9008099999999999

### Plotting

In [257]:
# Create graph
experimentDF['ShannonDiversity'].plot(title="ShannonDiversity")

# Maximum value in the data set
MaxShannonDiversity = experimentDF['ShannonDiversity'].max()

# Virulence associated with the maximum value
MaxVirulence = experimentDF['Virulence'][experimentDF['ShannonDiversity'] == MaxValue].values

# Text to display on graph
Text = str(MaxShannonDiversity) + " - " + str(MaxVirulence)

# Add text to graph
plt.annotate(Text, xy=(1, MaxShannonDiversity), xytext=(8, 0), 
                 xycoords=('axes fraction', 'data'), textcoords='offset points')


plt.show()

experimentDF[experimentDF['ShannonDiversity'] == MaxShannonDiversity]

Unnamed: 0,Virulence,Replicate,ShannonDiversity
92,0.6,43,2.90081


## Statistical analysis with pandas

# Exercise: classification of last 12-month claim data records

# Exercise: parse Apache log with apache-log-parser

# Exercise: OCR

# (Vectorize images)