# Linear Algebra

#### Importing Packages
Python has an extensive collection of third-party libraries, or ***packages***, with additional functions, data-structures, etc.  Many (most?) packages of interest are hosted on the Python Package Index ***pypi***, and can be installed into your environment using ***pip***.  The Anaconda distribution in the pre-work includes a number of these that are useful in data science, so you should have most of them installed already.  

ref:  https://pypi.python.org/pypi

A note on namespaces when importing - there are a few different ways to import:

    1. import numpy

    - all submodules of numpy are accessible in the numpy namespace
    - e.g. numpy.array([1,2,3])


    2. import numpy as np

    - same as 1 except an alias 'np' is created for the namespace instead
    - e.g. np.array([1,2,3])


    3. from numpy import *

    - adds all submodules to global namespace
    - e.g. array([1,2,3])
    - Note: This can be dangerous because if different modules have submodules with the same name than whatever is imported last will overrite what came before it - i.e. naming collision -> overwriting!.
    
    4. from numpy import array
    
    - will import only the indicated submodules into the global namespace
    - e.g. array([1,2,3])
    - Note: can be ok since you are being explicit
    
    We will generally use 2 and 4 (sparingly)

### NumPy

As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as various data science techniques rely upon it (e.g. Linear Regression, Support Vector Machines, Recommender Systems, Neural Networks).

NumPy is a package designed to be used in scientific computing, and specifically around building and manipulating N-dimensional array objects.

In [1]:
# import should always be at the top of the page
import numpy

import numpy as np

from numpy import absolute

# The next one is dangerous to do, and not recommended 
# except in cases where you know why you're using it
# from numpy import *

Now we can do the same thing three ways

In [2]:
print numpy.absolute(-10)
print np.absolute(-10)
print absolute(-10)

10
10
10


We can create the Linear Algebra Objects we saw in lecture

In [3]:
# we have a vector here
vector = np.array([1, 2, 1])

In [4]:
# we have a matrix here
data = np.array([[1, 2, 3],[2, 4, 9]])
data

array([[1, 2, 3],
       [2, 4, 9]])

In [5]:
data[0]  # first row

array([1, 2, 3])

In [15]:
# give me everything till row 2 (not including 2), and column 1
# note the index starts at 0
data[ : 2, 1]  # all rows, second column

array([2, 4])

Numpy lets us perform matrix operations

In [16]:
np.dot(data, vector)

array([ 8, 19])

We can transpose an array

In [17]:
# Transpose a matrix is making its rows as columns and vice versa
print data
print data.T

[[1 2 3]
 [2 4 9]]
[[1 2]
 [2 4]
 [3 9]]


#### Creating a square matrix array

In [18]:
# arrange just gives me an array of numbers till 25
# reshape, re shapes ths array into a matrix specified
a = np.arange(25).reshape(5,5)
# arange(n) is a function that creates a 1 row array of integers of length n 
# reshape(M,N) is a method converts a list to a matrix of size MxN
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [19]:
biga = a*10
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [20]:
# axis = none > take all the data
# axis = 0 > first row
# axis = 1 >
print biga.mean()
print biga.mean(0) #Average per column
# the 0 above is telling it what axis to work off. So it is axis=0
biga.mean(1) #average per row
# type(biga.mean(1))

120.0
[ 100.  110.  120.  130.  140.]


array([  20.,   70.,  120.,  170.,  220.])

Creating a matrix with numpy

In [21]:
bigm = np.matrix(biga-20)
bigm

matrix([[-20, -10,   0,  10,  20],
        [ 30,  40,  50,  60,  70],
        [ 80,  90, 100, 110, 120],
        [130, 140, 150, 160, 170],
        [180, 190, 200, 210, 220]])

Creating the Inverse of a Matrix

In [22]:
np.linalg.inv(biga-20)

array([[ -2.81474977e+13,  -1.52777778e-03,   5.62949953e+13,
         -2.22222222e-02,  -2.81474977e+13],
       [  3.51843721e+13,   2.25000000e-02,  -5.27765581e+13,
         -3.51843721e+13,   5.27765581e+13],
       [ -4.22212465e+13,   9.38249922e+13,  -7.97512434e+13,
          4.69124961e+13,  -1.87649984e+13],
       [  9.14793674e+13,  -1.87649984e+14,   9.26521798e+13,
          1.17281240e+13,  -8.20968682e+12],
       [ -5.62949953e+13,   9.38249922e+13,  -1.64193736e+13,
         -2.34562481e+13,   2.34562481e+12]])

#### Slices

In [23]:
# Converting matrix to an array
bigm = np.array(bigm)
print bigm
bigm[0]

[[-20 -10   0  10  20]
 [ 30  40  50  60  70]
 [ 80  90 100 110 120]
 [130 140 150 160 170]
 [180 190 200 210 220]]


array([-20, -10,   0,  10,  20])

In [25]:
#Same thing, but demonstrating the full slice with a colon
print biga
biga[0,:]

[[  0  10  20  30  40]
 [ 50  60  70  80  90]
 [100 110 120 130 140]
 [150 160 170 180 190]
 [200 210 220 230 240]]


array([ 0, 10, 20, 30, 40])

In [26]:
print biga
biga[: , 2]

[[  0  10  20  30  40]
 [ 50  60  70  80  90]
 [100 110 120 130 140]
 [150 160 170 180 190]
 [200 210 220 230 240]]


array([ 20,  70, 120, 170, 220])

#### Describing your Arrays

In [27]:
compa = np.arange(30).reshape(5,3,2)
compa

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]],

       [[24, 25],
        [26, 27],
        [28, 29]]])

In [28]:
# lets describe it

# shape of the matrix
print compa.shape
# number of dimensions
print compa.ndim
# get the data type
print compa.dtype

(5, 3, 2)
3
int64


In [29]:
# ?????
compa[3,:,1]

array([19, 21, 23])

In [32]:
# We can assign values using list-like index
# But be careful on types
# our compa is of type int, so the result is of type integer
compa[0,0,0] = 5.9
compa[0,0,0]

5

We can change the datatype when needed

In [33]:
# reassigning the array to be a type float
compa = compa.astype(float)
compa[0,0,0] = 5.75
compa[0,0,0]

5.75

#### Stacking arrays

You must stack using dimensions of the saem size

In [34]:
# both the matrixes should be of the same size
a = np.array((1,2,3))
b = np.array((2,3,4))
print 'H Stack'
# hstack stacks it horizontally
print np.hstack((a,b))
print 'V Stack'
# vstack stacks it vertically
print np.vstack((a,b))

H Stack
[1 2 3 2 3 4]
V Stack
[[1 2 3]
 [2 3 4]]


In [36]:
# ????? what do the brackets do? 
a = np.array([[1],[2],[3]])
b = np.array([[2],[3],[4]])
print 'H Stack'
print np.hstack((a,b))
print 'V Stack'
print np.vstack((a,b))

H Stack
[[1 2]
 [2 3]
 [3 4]]
V Stack
[[1]
 [2]
 [3]
 [2]
 [3]
 [4]]


### Using Random Numbers

Random numbers are very helpful and are necessary at times for testing data pipelines and running statistical analyses. Functions for creating random values are under numpy.random.

In [37]:
#Create a randomized array
# between the numbers 0 and 1 
rm = np.random.rand(5,5)
rm

array([[ 0.73066195,  0.79614213,  0.99213169,  0.48559635,  0.88139695],
       [ 0.63938298,  0.74557455,  0.86621181,  0.02188044,  0.59477655],
       [ 0.37800489,  0.61997048,  0.28811948,  0.03968556,  0.16886658],
       [ 0.70663519,  0.02357847,  0.39673412,  0.98563536,  0.39808045],
       [ 0.37018351,  0.67772518,  0.7781151 ,  0.36079116,  0.58830422]])

In [38]:
rm.shape

(5, 5)

You can shuffle values randomly as well

In [41]:
# This will shuffle along the first index of a multi-dimensional array
# will only shuffle in 1 dimension, in their own columns
# if we want them to shuffle in their rows, we need to change the axis
np.random.shuffle(rm)
rm

array([[ 0.63938298,  0.74557455,  0.86621181,  0.02188044,  0.59477655],
       [ 0.70663519,  0.02357847,  0.39673412,  0.98563536,  0.39808045],
       [ 0.37018351,  0.67772518,  0.7781151 ,  0.36079116,  0.58830422],
       [ 0.73066195,  0.79614213,  0.99213169,  0.48559635,  0.88139695],
       [ 0.37800489,  0.61997048,  0.28811948,  0.03968556,  0.16886658]])

In [42]:
print rm.mean()
print rm.mean(0) #Average per column
print rm.mean(1) #average per row

0.541367407459
[ 0.56497371  0.57259816  0.66426244  0.37871778  0.52628495]
[ 0.57356527  0.50213272  0.55502383  0.77718582  0.2989294 ]


In [43]:
# for a different Normal Distribution, use np.random.normal
# here 5 in the mean, 9 is the variance, (30,30) is the matrix size
rm = np.random.normal(5,9,(30,30))
rm

array([[  2.98133747e+01,  -2.46147476e+00,  -1.66344454e+00,
          4.22438164e+00,   1.53280304e+01,   1.27047572e+01,
          9.76589725e+00,   6.31116361e+00,   1.28332908e+00,
         -3.12682980e-01,   1.94463950e+01,   2.02127199e+01,
          3.15744692e+00,   1.86457763e+01,   1.40165169e+01,
          1.36541791e+01,   1.19418544e+01,   1.96635661e+01,
          5.31766902e+00,   9.33962848e+00,   5.76535507e+00,
          7.33554872e+00,   1.67386156e+01,   1.37189193e+01,
          8.76625973e+00,   1.99113082e+01,  -3.39847407e+00,
          1.38612164e+01,   1.10516454e+01,   1.40753057e+01],
       [  2.05295125e+01,   4.95516487e+00,   2.39489789e+01,
          3.68754782e+00,  -3.75437895e+00,   2.66187404e+01,
          7.05428977e-01,  -1.91114959e+00,   1.34292529e+01,
         -1.45923381e+01,  -8.56096406e+00,   1.02161333e+01,
          3.93456088e+00,  -7.17593404e-01,   2.17176305e+01,
         -1.66018710e-01,   1.97816588e+00,  -7.78084204e+00,
       

In [44]:
print rm.mean(), "which is hopefully close to the input mean"
print rm.var(), "which variance = stdev squared"
print np.median(rm)

5.00663509521 which is hopefully close to the input mean
82.5201880255 which variance = stdev squared
4.96311342816


Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

### Exercise 1

1) Create a 4x5 array of integers between 0 and 19

In [45]:
ex = np.arange(20).reshape(4,5)
print ex

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]


2) Create a 50x500 array with a mean of 20 and variance of 100. Save it to a variable called  biggie

In [63]:
# Variance is usually the square of the number given
biggie = np.random.normal(20,10, (50, 500) )
print biggie

[[ 17.72741957   3.66938455  12.08104638 ...,  13.9382309   25.38571311
   22.34839827]
 [ 34.30601698  13.95795596  16.64997851 ...,  10.38382686  13.66924252
   15.37458824]
 [ 30.93492686  17.05965771  24.32626644 ...,  40.38429984  16.2479701
    2.67857599]
 ..., 
 [  7.37576115   7.85854653  27.74031134 ...,  11.03406939  30.96892418
    9.32631188]
 [ 13.13853058   6.3177513   26.70837055 ...,  15.01226236  22.31416742
   11.45072277]
 [ 36.52411518  15.89729485   9.05742654 ...,  18.09887587  19.39901378
   16.91916581]]


In [64]:
print biggie.mean(), "which is hopefully close to the input mean"
print biggie.var(), "which variance = stdev squared"
print np.median(biggie)

19.9938885176 which is hopefully close to the input mean
99.4398307253 which variance = stdev squared
19.9615794923


3) Change the mean of the array to a value within 1 of 0 and the variance within 1 of 25. Think about what the mean and the variance represent and try using various mathematical operations.

In [65]:
big = np.matrix((biggie-20)/2)
print big

[[ -1.13629022  -8.16530773  -3.95947681 ...,  -3.03088455   2.69285655
    1.17419913]
 [  7.15300849  -3.02102202  -1.67501074 ...,  -4.80808657  -3.16537874
   -2.31270588]
 [  5.46746343  -1.47017114   2.16313322 ...,  10.19214992  -1.87601495
   -8.66071201]
 ..., 
 [ -6.31211942  -6.07072673   3.87015567 ...,  -4.48296531   5.48446209
   -5.33684406]
 [ -3.43073471  -6.84112435   3.35418527 ...,  -2.49386882   1.15708371
   -4.27463862]
 [  8.26205759  -2.05135258  -5.47128673 ...,  -0.95056206  -0.30049311
   -1.54041709]]


In [57]:
print big.mean(), "which is hopefully close to the input mean"
print big.var(), "which variance = stdev squared"

-0.0297352172615 which is hopefully close to the input mean
25.3422345997 which variance = stdev squared


## Pandas (python package)

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Pandas is great for tabular/indexed data

In [66]:
# NOTE: you should normally put all your imports at the top of the file
import pandas as pd

In [67]:
data = pd.read_csv('../data/nytimes.csv')

In [68]:
# Note here we're calling the head method on the dataframe to return the 'head' of the 
# dataframe, in this case the first 4 lines
# head() actually creates a new copy of the data, this is important later in the course!
data.head(4)

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
0,36,0,3,0,1
1,73,1,3,0,1
2,30,0,3,0,1
3,49,1,3,0,1


In [69]:
data[0:4]

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
0,36,0,3,0,1
1,73,1,3,0,1
2,30,0,3,0,1
3,49,1,3,0,1


In [70]:
# Each DataFrame has an index
# Sometimes you will need to reindex
data.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9, 
            ...
            458431, 458432, 458433, 458434, 458435, 458436, 458437, 458438,
            458439, 458440],
           dtype='int64', length=458441)

In [71]:
# This is a Series
# A DataFrame is made of of several Series with the same index
data.Age

0         36
1         73
2         30
3         49
4         47
5         47
6          0
7         46
8         16
9         52
10         0
11        21
12         0
13        57
14        31
15         0
16        40
17        31
18        38
19         0
20        59
21        61
22        48
23        29
24         0
25        19
26        19
27        48
28        48
29        21
          ..
458411    55
458412    68
458413     0
458414    21
458415    35
458416    26
458417    41
458418    58
458419    46
458420    45
458421    46
458422    47
458423    22
458424    21
458425    40
458426    49
458427    43
458428    40
458429    49
458430     0
458431    21
458432    30
458433    21
458434    61
458435    51
458436     0
458437     0
458438    72
458439     0
458440     0
Name: Age, dtype: int64

In [72]:
# 
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 458441 entries, 0 to 458440
Data columns (total 5 columns):
Age            458441 non-null int64
Gender         458441 non-null int64
Impressions    458441 non-null int64
Clicks         458441 non-null int64
Signed_In      458441 non-null int64
dtypes: int64(5)
memory usage: 21.0 MB


In [73]:
data.describe()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
count,458441.0,458441.0,458441.0,458441.0,458441.0
mean,29.482551,0.367037,5.007316,0.092594,0.70093
std,23.607034,0.481997,2.239349,0.309973,0.457851
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,3.0,0.0,0.0
50%,31.0,0.0,5.0,0.0,1.0
75%,48.0,1.0,6.0,0.0,1.0
max,108.0,1.0,20.0,4.0,1.0


In [74]:
# We can change this data into numpy
type(data.Age.values)

numpy.ndarray

#### Just like in numpy, we can use mean, var, and other functions on the data

In [75]:
print data.Age.mean()
print data.Age.var()
print data.Age.max()
print data.Age.min()

29.4825506445
557.292044027
108
0


In [78]:
# Function that groups users by age.
def map_age_category(x):
    if x < 18:
        return '1'
    elif x < 25:
        return '2'
    elif x < 32:
        return '3'
    elif x < 45:
        return '4'
    else:
        return '5'

# apply takes the given function, and for each value in age it is going to apply that function to the value.
# by doing data['age_categories], I am creating a new column
data['age_categories'] = data['Age'].apply(map_age_category)

In [79]:
data.head()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In,age_categories
0,36,0,3,0,1,4
1,73,1,3,0,1,5
2,30,0,3,0,1,3
3,49,1,3,0,1,5
4,47,1,11,0,1,5


#### Sorting data

In [82]:
# axis = 1 is columns
# axis = 0 is index/rows
# default is axis = 0
data.sort_index(axis=1, ascending=False).head()

Unnamed: 0,age_categories,Signed_In,Impressions,Gender,Clicks,Age
0,4,1,3,0,0,36
1,5,1,3,1,0,73
2,3,1,3,0,0,30
3,5,1,3,1,0,49
4,5,1,11,1,0,47


In [83]:
data.sort('Signed_In').head()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In,age_categories
458440,0,0,3,0,0,1
149914,0,0,5,0,0,1
149915,0,0,10,1,0,1
149916,0,0,6,1,0,1
353762,0,0,3,0,0,1


In [84]:
ran_data = [
    ['a', 1]
    , ['b', 2]
    , ['c', 3]
]
df = pd.DataFrame(ran_data, columns=['col_a', 'numeric'])
df

Unnamed: 0,col_a,numeric
0,a,1
1,b,2
2,c,3


#### Indexing functions

Pandas Dataframes support various methods for indexing:

- .iloc
- .loc
- .ix

In [85]:
df

Unnamed: 0,col_a,numeric
0,a,1
1,b,2
2,c,3


In [86]:
# iloc accesses a row by its row number
df.iloc[0]

col_a      a
numeric    1
Name: 0, dtype: object

In [87]:
df.set_index('col_a', inplace = True)
df

Unnamed: 0_level_0,numeric
col_a,Unnamed: 1_level_1
a,1
b,2
c,3


In [88]:
# loc accesses a dataframe row by its index label (or column label)
df.loc['a'] = 5
df.loc['b'] = 3

In [89]:
df

Unnamed: 0_level_0,numeric
col_a,Unnamed: 1_level_1
a,5
b,3
c,3


In [90]:
# This can be used to add new columns
df.loc[:,'C'] = df.loc[:,'numeric']

df

Unnamed: 0_level_0,numeric,C
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1
a,5,5
b,3,3
c,3,3


In [91]:
# is equivalent to:
df['D'] = df.loc[:,'numeric']

df

Unnamed: 0_level_0,numeric,C,D
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,5,5,5
b,3,3,3
c,3,3,3


.ix is the generic form of indexers

Values can be set by index and index + column

In [92]:
df.ix[2] = 2
df.ix[2, 'C'] = 3

df

Unnamed: 0_level_0,numeric,C,D
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,5,5,5
b,3,3,3
c,2,3,2


### Combining DataFrames
#### Appending
We can append dataframes together

In [93]:
# append stacks them on top of each other
df_combine = df.append(df)
df_combine

Unnamed: 0_level_0,numeric,C,D
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,5,5,5
b,3,3,3
c,2,3,2
a,5,5,5
b,3,3,3
c,2,3,2


In [94]:
# When DataFrames are appended together, we often need to create a new index
# we can reset the index using the following function
df_combine.reset_index()

Unnamed: 0,col_a,numeric,C,D
0,a,5,5,5
1,b,3,3,3
2,c,2,3,2
3,a,5,5,5
4,b,3,3,3
5,c,2,3,2


#### Join lets us join together dataframes using their index

In [95]:
df_2 = pd.DataFrame([1, 2, 3], columns=['col'])
df_2

Unnamed: 0,col
0,1
1,2
2,3


In [96]:
# The default is left join, so Null values are placed
# where values are misssing
df.join(df_2)


Unnamed: 0_level_0,numeric,C,D,col
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,5,5,5,
b,3,3,3,
c,2,3,2,


In [97]:
# reset the index
df_1 = df.reset_index()
df_1

Unnamed: 0,col_a,numeric,C,D
0,a,5,5,5
1,b,3,3,3
2,c,2,3,2


In [98]:
# try joining again:
df_1.join(df_2)


Unnamed: 0,col_a,numeric,C,D,col
0,a,5,5,5,1
1,b,3,3,3,2
2,c,2,3,2,3


#### Merge allows us to join on any fields

In [99]:
# Merge has a default of inner join
# So where the join misses rows are omitted
# left_on is the left data frame
# right_on is the right data frame
df.merge(df_2, left_on='numeric', right_on='col')

Unnamed: 0,numeric,C,D,col
0,3,3,3,3
1,2,3,2,2


### Concat combines a list of DataFrames together

In [100]:
# It can be used like append
pd.concat([df, df])

Unnamed: 0_level_0,numeric,C,D
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,5,5,5
b,3,3,3
c,2,3,2
a,5,5,5
b,3,3,3
c,2,3,2


In [101]:
# But concat will create a spare DataFrame when columns don't match
# This can create huge dataframes when mismatches occur
pd.concat([df, df_2])

Unnamed: 0_level_0,C,D,col,numeric
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,5.0,5.0,,5.0
b,3.0,3.0,,3.0
c,3.0,2.0,,2.0
0,,,1.0,
1,,,2.0,
2,,,3.0,


## Exercise 2
### Combining numpy and pandas

1) Create 2 arrays of integers

One should be created using np.random

In [109]:
max1 = np.random.rand(2,2)

max1 

array([[ 0.3954838 ,  0.34404913],
       [ 0.67605838,  0.73840937]])

In [110]:
max2 = np.random.rand(2,2)

max2

array([[ 0.8385117 ,  0.1609604 ],
       [ 0.44901179,  0.77349078]])

2) Turn those arrays into pandas DataFrames

The columns can be named numerically

In [112]:
maxDat1 = pd.DataFrame(max1, columns=['col_a', 'col_b'])
maxDat1


Unnamed: 0,col_a,col_b
0,0.395484,0.344049
1,0.676058,0.738409


In [113]:
maxDat2 = pd.DataFrame(max2, columns=['col_a', 'col_b'])
maxDat2

Unnamed: 0,col_a,col_b
0,0.838512,0.16096
1,0.449012,0.773491


3) Use some of the summary functions on the dataframes and arrays

Show how mean and var give the same response in python and numpy

In [115]:
print maxDat1.col_a.mean()
print maxDat1.col_a.var()

0.535771088151
0.0393610462623



4) Add an extra index using .loc

In [125]:
maxDat1['D'] = maxDat1.loc[:,'col_b']
maxDat1

Unnamed: 0,col_a,col_b,D
0,0.395484,0.344049,0.344049
1,0.676058,0.738409,0.738409


5) Using merge or join, create a single DataFrame from the two

In [126]:
# make sure the column names are different in the data frames. It will throw an error otherwise
# for join or merge the data frames need not be of the same size
maxDat1.join(maxDat2, rsuffix='_2')

Unnamed: 0,col_a,col_b,D,col_a_2,col_b_2
0,0.395484,0.344049,0.344049,0.838512,0.16096
1,0.676058,0.738409,0.738409,0.449012,0.773491


6) Try testing out the groupby functions

df.groupby(column).agg (agg can be an aggregate function, try sum, max, min...)

Resources can be found here: http://pandas.pydata.org/pandas-docs/stable/10min.html#grouping


## Plotting!

In [None]:
import pandas.io.data
import datetime
import matplotlib.pyplot as plt

%matplotlib inline

mu, sigma = 0, 0.1
normal_dist = np.random.normal(mu, sigma, 1000)
aapl = pd.io.data.get_data_yahoo('FB', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))
aapl.head()

## MatPlotLib

MatPlotLib is a standard, granular method for building visualizations. Although tried and true, it can be cumbersome compared to other higher level packages such as Seaborn or Bokeh. Note most visualization packages use matplotlib as their base.

In [None]:
fig = plt.figure(figsize=(20,16))

ax = fig.add_subplot(2,2,1)
ax.plot(aapl.index, aapl['Close'])
ax.set_title('Line plots', size=24)

ax = fig.add_subplot(2,2,2)
ax.plot(aapl['Close'], 'o')
ax.set_title('Scatter plots', size=24)

ax = fig.add_subplot(2,2,3)
ax.hist(normal_dist, bins=50)
ax.set_title('Histograms', size=24)
ax.set_xlabel('count', size=16)

ax = fig.add_subplot(2,2,4)
ax.boxplot(normal_dist)
ax.set_title('Boxplots', size=24)

### Bokeh
To install Bokeh, go to a terminal and type:

`conda install bokeh` 

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [None]:
from bokeh.plotting import figure, output_notebook,show
output_notebook()

In [None]:
# prepare some data
x = aapl.Low
y = aapl['High']

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# add a line renderer with legend and line thickness
p.circle(x, y, legend="High vs. Low", line_width=2)

# show the results
show(p)

In [None]:
x = aapl.index
y = aapl.Close
p = figure(title="Stock Open & Close over time", x_axis_label='Date', y_axis_label='High',x_axis_type="datetime")
# Note that I've declared the x_axis_type
p.square(x, y, legend="Close")
p.circle(x,aapl.Open,legend='Open',color='red')
# show the results
show(p)

## Pandas Plotting!

The plot method is a great, quick way to visualize your dataframes. By selecting the columns you care to view, calling .plot() on the dataframe defaults to a line chart vs. the index.

We will be revisiting this so just take a second to appreciate what can be done with one line of code.

In [None]:
aapl[['Open','Close']].plot()

In [None]:
aapl[['High','Low','Open','Close']].plot(kind='box')