#  NumPy
**NumPy is the fundamental package for scientific computing with Python.**

* **Module required**: ```numpy```
* **Installation**: ```pip install numpy``` - http://www.scipy.org/scipylib/download.html

In [None]:
import numpy as np

## Arrays
**A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers.**

In [None]:
a = np.array([1, 2, 3])     
b = np.zeros((2, 2))        # Create an array of all ones
c = np.ones((1, 2))         # Create an array of all zeros
d = np.full((2,2), 7)       # Create a constant array
e = np.eye(2)               # Create 2x2 identity matrix
f = np.random.random((2,2)) # Create an array filled with random values

my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

In [None]:
print "a:\n", a, "\n"
print "b:\n", b, "\n"
print "c:\n", c, "\n"
print "d:\n", d, "\n"
print "e:\n", e, "\n"
print "my_array:\n", my_array, "\n"

## Array indexing

**Normal indexing**

In [None]:
my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print my_array[0]    # same as for Python lists
print my_array[-1]   # same as for Python lists
print my_array[0][1] # same as for Python lists
print my_array[0, 1] # new in Numpy arrays (multi-indexing)

**Integer array indexing**

In [None]:
my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

my_array[[0, 1, 2], [0, 0, 0]]

In [None]:
# Equivalent to this
np.array([my_array[0, 0], my_array[1, 0], my_array[2, 0]])

**Boolean array indexing**

In [None]:
my_array = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

bool_idx = (my_array > 2)
print bool_idx

my_array[bool_idx]

In [None]:
# Or directly
my_array[my_array > 2]

In [None]:
# Multiple boolean conditions
my_array[(my_array > 2) & (my_array % 2 == 0) & (my_array <= 10)]

**Slicing**

In [None]:
b = my_array[:2, 1:3] # take first column, 
print "my_array:\n", my_array, "\n"
print "b:\n", b, "\n" 

# A slice of an array is a view into the same data, 
# so modifying it will modify the original array
print my_array[0,1]
b[0, 0] = 54
print my_array[0,1]

## Datatypes

In [None]:
x = np.array([1, 2])
print x.dtype

x = np.array([1.0, 2.0])
print x.dtype

x = np.array([1, 2], dtype=np.int64)
print x.dtype

## Array math

In [None]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Addition
print "Add:"
print x + y    # np.add(x, y)

# Difference
print "Substract:"
print x - y    # np.subtract(x, y)

# Product
print "Multiply: "
print x * y    # np.multiply(x, y)

# Matrix product 
print "Matrix product:"
print x.dot(y) # np.dot(x, y)

# Division
print "Divide:"
print x / y    # np.divide(x, y)

# Square root
print "Square root:"
print np.sqrt(x)

# Sum
print "Sum: "
print np.sum(x)
print np.sum(x, axis=0) # Compute sum of each column
print np.sum(x, axis=1) # Compute sum of each row

## Broadcasting
**Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes (dimensions) when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.**

In [None]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x)   # Create an empty matrix with the same shape as x

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(4):
    y[i, :] = x[i, :] + v
print y

This works; however when the matrix x is very large, computing an explicit loop in Python could be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x and vv.

In [None]:
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting

## Functions

* **```add(x1, x2[, out])```** - Add arguments element-wise.
* **```subtract(x1, x2[, out])```** - Subtract arguments, element-wise.
* ** ```multiply(x1, x2[, out])```** - Multiply arguments element-wise.
* **```divide(x1, x2[, out])```** - Divide arguments element-wise.
* **```logaddexp(x1, x2[, out])```** - Logarithm of the sum of exponentiations of the inputs.
* **```logaddexp2(x1, x2[, out])```** - Logarithm of the sum of exponentiations of the inputs in base-2.
* **```true_divide(x1, x2[, out])```** - Returns a true division of the inputs, element-wise.
* **```floor_divide(x1, x2[, out])```** - Return the largest integer smaller or equal to the division of the inputs.
* **```negative(x[, out])```** - Numerical negative, element-wise.
* **```power(x1, x2[, out])```** - First array elements raised to powers from second array, element-wise.
* **```remainder(x1, x2[, out])```** - Return element-wise remainder of division.
* **```mod(x1, x2[, out])```** - Return element-wise remainder of division.
* **```fmod(x1, x2[, out])```** - Return the element-wise remainder of division.
* **```absolute(x[, out])```** - Calculate the absolute value element-wise.
* **```rint(x[, out])```** - Round elements of the array to the nearest integer.
* **```sign(x[, out])```** - Returns an element-wise indication of the sign of a number.
* **```conj(x[, out])```** - Return the complex conjugate, element-wise.
* **```exp(x[, out])```** - Calculate the exponential of all elements in the input array.
* **```exp2(x[, out]) ```**	- Calculate 2**p for all p in the input array.
* **```log(x[, out]) ```** - Natural logarithm, element-wise.
* **```log2(x[, out])```** - Base-2 logarithm of x.
* **```log10(x[, out])```** - Return the base 10 logarithm of the input array, element-wise.
* **```expm1(x[, out])```** - Calculate exp(x) - 1 for all elements in the array.
* **```log1p(x[, out]) ```** - Return the natural logarithm of one plus the input array, element-wise.
* **```sqrt(x[, out])```** - Return the positive square-root of an array, element-wise.
* **```square(x[, out])```** - Return the element-wise square of the input.
* **```reciprocal(x[, out])```** - Return the reciprocal of the argument, element-wise.
* **```ones_like(a[, dtype, order, subok])```** - Return an array of ones with the same shape and type as a given array.

# Matplotlib
**Matplotlib provides a MATLAB-like plotting framework.**

* **Module required**: ```matplotlib```
* **Installation**: ```pip install matplotlib``` - http://matplotlib.org/users/installing.html 

In [None]:
import matplotlib.pyplot as plt

## Plotting

In [None]:
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()

**Plot x versus x**

In [None]:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])

# axis [xmin, xmlax, ymin, ymax]
plt.axis([0, 6, 0, 20])
plt.show()

**With numpy arrays**

In [None]:
import numpy as np

# evenly sampled time at 200 ms intervals
t = np.arange(0., 5., 0.2)

# red dashed, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

## Subplots

In [None]:
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(1)
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()

**Multiple figures**

In [None]:
plt.figure(1)                # the first figure
plt.subplot(211)             # the first subplot in the first figure
plt.plot([1, 2, 3])
plt.subplot(212)             # the second subplot in the first figure
plt.plot([4, 5, 6])


plt.figure(2)                # a second figure
plt.plot([4, 5, 6])          # creates a subplot(111) by default

plt.figure(1)                # figure 1 current; subplot(212) still current
plt.subplot(211)             # make subplot(211) in figure1 current
plt.title('Easy as 1, 2, 3') # subplot 211 title

plt.show()

**Working with text**

In [None]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)


plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$') # TeX equation expressions
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

**Anonating text**

In [None]:
ax = plt.subplot(111)

t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)
line, = plt.plot(t, s, lw=2)

plt.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.ylim(-2,2)
plt.show()

# Pandas
**Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.**

* **Module required**: ```pandas```
* **Installation**: ```pip install pandas``` - http://pandas.pydata.org/pandas-docs/stable/install.html

In [None]:
from pandas import *  
import pandas as pd

## Series
**One-dimensional ndarray with axis labels (including time series).**
[Documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series)

In [None]:
s = pd.Series([1,3,5,np.nan,6,8])
s

## Dataframes
**Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.**
[Documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe)

### Create Data

In [None]:
################
### Datasets ###
################

# A list of tuples
names = ['Bob','Jedssica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
data_zipped = zip(names,births)

# A numpy array
numpy_array = np.random.randn(6,4)
dates = pd.date_range('20130101', periods=6)

# A dict with column names as keys
data_dict = {
    "Names": ["Bob", "Jessica", "Mary", "John", "Mel"],
    "Births": [968, 155, 77, 578, 973]
}

# A dict with column values as keys and values
data_dict2 = {
    "Bob": 968,
    "Jessica": 155,
    "Mary": 77,
    "John": 578,
    "Mel": 973
}

**Create a DataFrame from a list of tuples**

In [None]:
df = pd.DataFrame(data = data_zipped, columns=['Names', 'Births'])
df

**Create a DataFrame from a dictionary**

In [None]:
df = pd.DataFrame(data_dict)
print df

In [None]:
df2 = pd.DataFrame(data_dict2.items(), columns=["Names", "Births"])
print df2

**Create a DataFrame from a Numpy Array**

In [None]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(numpy_array, index=dates, columns=list('ABCD'))
print df

**Output data to CSV**

In [None]:
df.to_csv('files/births1889_output.csv', index=False)

**Output data to JSON**

In [None]:
df.to_json('files/births1889_output.json')

### Get Data

**Read data from CSV**

In [None]:
df = pd.read_csv("files/births1889_output.csv")
print df

**Read data from JSON**

In [None]:
df = pd.read_json('files/simple.json')
print df

### Select Data

In [None]:
# Dataset
experimentDF = pd.read_csv("files/parasite_data.csv", na_values=[" "], skiprows=1,names=['Virulence', 'Replicate', 'ShannonDiversity'])
experimentDF

**Select rows**

In [None]:
# select 12th row
experimentDF[12:13]

In [None]:
# select row 3 to 15
experimentDF[3:15]

In [None]:
# select rows 3 to 15 in columns "Virulence" and "ShannonDiversity"
experimentDF.loc[3:15, ["Virulence", "ShannonDiversity"]]

In [None]:
# select 12th row in "Virulence" column
experimentDF.loc[12, ["Virulence"]]

**Select columns**

In [None]:
# select column "Virulence"
print experimentDF["Virulence"] # by label *FASTER*
print experimentDF.Virulence    # by attribute

In [None]:
# select column "Virulence" and "ShannonDiversity
print experimentDF.loc[:, ["Virulence", "ShannonDiversity"]] # using 'loc' *FASTER*
print experimentDF[["Virulence", "ShannonDiversity"]]        # using [[col1, col2, ...]]

In [None]:
# TIMEIT: label access is faster than attribute access !
from timeit import timeit
print "Label access VS Attribute access:"
print timeit('experimentDF["Virulence"]', setup='from __main__ import experimentDF')
print timeit('experimentDF.Virulence', setup='from __main__ import experimentDF')

**Select scalar element**

In [None]:
# select element at column "ShannonDiversity" and row 5
experimentDF.at[0, "ShannonDiversity"] # using 'at' *FASTER*
experimentDF["ShannonDiversity"][0]    # using [][]

In [None]:
# TIMEIT: 'at' is faster than multi-indexing !
import timeit
print "'at' versus [][]:"
print timeit.timeit('experimentDF.at[0, "ShannonDiversity"]', setup="from __main__ import experimentDF")
print timeit.timeit('experimentDF["ShannonDiversity"][0]', setup="from __main__ import experimentDF")

**Select by boolean expression(s)**

In [None]:
# show all entries for which the Shannon diversity > 2.0
experimentDF[experimentDF["ShannonDiversity"] > 2.0]

In [None]:
# multiple boolean expressions
# show all entries for which the virulence == 0.6 and Shannon diversity > 1.5
experimentDF[(experimentDF["Virulence"] == 0.6) & (experimentDF["ShannonDiversity"] > 1.5)]

**Handling NA/NaN values **

In [None]:
experimentDF[300:]

**```dropna()```** - *drop rows that have a NA/NaN value*

In [None]:
# Drop NA
experimentDF["Virulence"].dropna()

**```fillna(r)```** - *replace NA values by r*

In [None]:
# Fill NA
experimentDF.fillna(0.0)["Virulence"]

**Functions**

**```index```** - *list DataFrame indexes*

In [None]:
experimentDF.index

**```columns```** - *list DataFrame column names*

In [None]:
experimentDF.columns

**```values```** - *list DataFrame column content*

In [None]:
experimentDF.values

**T** - *transpose a DataFrame*

In [None]:
experimentDF.T

### Sort Data

**Sort by values**

In [None]:
experimentDF_sorted = experimentDF.sort_values(by="ShannonDiversity", ascending=False)
experimentDF_sorted.head(5)

**Sort by index**

In [None]:
experimentDF_sorted = experimentDF.sort_index(axis=1, ascending=False)
experimentDF_sorted.head(5)

### Plot Data
**Pandas also uses ```matplotlib.pyplot``` to plot its DataFrames.**

In [None]:
# Import the library
import matplotlib.pyplot as plt

# Create graph
plot = experimentDF['ShannonDiversity'].plot(title="ShannonDiversity")
plot.set_xlabel("ShannonDiversity")
plot.set_ylabel("Virulence")

# Maximum value in the data set
MaxShannonDiversity = experimentDF['ShannonDiversity'].max()

# Virulence associated with the maximum value
MaxVirulence = experimentDF['Virulence'][experimentDF['ShannonDiversity'] == MaxShannonDiversity].values

# Text to display on graph
Text = str(MaxShannonDiversity) + " - " + str(MaxVirulence)

# Add text to graph
plt.annotate(Text, xy=(1, MaxShannonDiversity), xytext=(8, 0), 
                 xycoords=('axes fraction', 'data'), textcoords='offset points')

plt.show()

experimentDF[experimentDF['ShannonDiversity'] == MaxShannonDiversity]

## Statistical analysis with pandas

**```max()```**

In [None]:
experimentDF['ShannonDiversity'].max()

**```min()```**

In [None]:
experimentDF['ShannonDiversity'].min()

**```mean()```**

In [None]:
experimentDF["ShannonDiversity"].mean()

**```var()```**

In [None]:
experimentDF["ShannonDiversity"].var()

**```std()```**

In [None]:
experimentDF["ShannonDiversity"].std()

**```describe()```**

In [None]:
experimentDF["ShannonDiversity"].describe()

**Going further with ```stats``` from ```scipy``` module**

In [None]:
from scipy import stats

**Standard Error on the Mean (SEM)**

In [None]:
print stats.sem(experimentDF["ShannonDiversity"])

**Mann-Whitney-Wilcoxon (MWW) RankSum test**

In [None]:
treatment1 = experimentDF[experimentDF["Virulence"] == 0.5]["ShannonDiversity"]  
treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]  

z_stat, p_val = stats.ranksums(treatment1, treatment2)  

print z_stat
print p_val   # < 0.05 - treatment1 and treament2 significantly differ

**One-way analysis of variance (ANOVA)**

In [None]:
treatment1 = experimentDF[experimentDF["Virulence"] == 0.7]["ShannonDiversity"]  
treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]  
treatment3 = experimentDF[experimentDF["Virulence"] == 0.9]["ShannonDiversity"]  

f_val, p_val = stats.f_oneway(treatment1, treatment2, treatment3)  

print f_val
print p_val  # > 0.05 - treatment1, treatment2, treatment3 are similar

**Bootstrapped 95% confidence intervals**

* **Module required**: ```scikits```
* **Installation**: ```sudo easy_install scikits```

In [None]:
from scipy import mean
import scikits.bootstrap as bootstrap

treatment1 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"][:10]  

# compute 95% confidence intervals around the mean  
CIs = bootstrap.ci(data=treatment1, statfunction=mean)  

# compute 80% confidence intervals around the mean
CIs2 = bootstrap.ci(data=treatment1, statfunction=mean, alpha=0.2)

print CIs
print CIs2

# Exercise: Parse a switch record with Pandas and output to MySQL / DynamoDB
### Problem
**Objectives:** 
- Switch record file (CSV) is located at ```https://172.20.104.147/moi/python_classes/swith_records.csv```
- Read the CSV file and elaborate a strategy to extract data.
- Parse the information in ```switch_records.csv``` and construct Pandas dataframes from it.
- Write each table to it's own JSON file.
- Create and populate corresponding tables in MySQL.
- Create and populate corresponding tables in DynamoDB.

**Information:**
- ```switch_records.csv``` contains multiple tables:
    - Inventory
    - HP BladeSystem Rack  
    - Network Interface
- Each table has it's name on the first line, the column names on the second line and then the data.
    
**Steps**
1. Write a function ```split_csv(csv_path, table_names)``` to split the file into temporary csv files (one for each table).
2. Load each file into a Pandas ```DataFrame``` using the ```pandas.read_csv()``` function.
3. Generate JSON files from the Pandas dataframes you created.
4. Create your database with a unique id (1 letter of your first name, 7 letters of your last name).
4. Connect to MySQL and construct tables with the data.
5. Connect to DynamoDB and construct tables with the data.

**Help**
1. If you're stuck at step 1:
    * You can use the ```csv``` module to import the file and loop through each row.
    * You can use ```dirname``` and ```join``` from the ```os.path``` module to manipulate path strings easily.
    * If you're still stuck, you can use the ```split_csv(csv_path, tables_names)``` function that I wrote to split the tables into multiple csv files. The function returns the paths of the files written, so that you can iterate over them.
    * If you need it, just run the following IPython Notebook cell and you will have access to the ```split_csv``` function.
2. If you need to skip rows, you can pass ```skiprows=[i1, i2, ...]``` to pandas' ```read_csv``` function.classification of last 12-month claim data records

In [None]:
%load solutions/help_csv.py

### Solutions

In [None]:
%load solutions/switch_solution.py

### Output files

In [None]:
%load temp/Inventory.csv

In [None]:
%load temp/HP\ BladeSystem\ Rack.csv

In [None]:
%load temp/Network\ Interface.csv