# Linear Algebra

#### Importing Packages
Python has an extensive collection of third-party libraries, or ***packages***, with additional functions, data-structures, etc.  Many (most?) packages of interest are hosted on the Python Package Index ***pypi***, and can be installed into your environment using ***pip***.  The Anaconda distribution in the pre-work includes a number of these that are useful in data science, so you should have most of them installed already.  

ref:  https://pypi.python.org/pypi

A note on namespaces when importing - there are a few different ways to import:

    1. import numpy

    - all submodules of numpy are accessible in the numpy namespace
    - e.g. numpy.array([1,2,3])


    2. import numpy as np

    - same as 1 except an alias 'np' is created for the namespace instead
    - e.g. np.array([1,2,3])


    3. from numpy import *

    - adds all submodules to global namespace
    - e.g. array([1,2,3])
    - Note: This can be dangerous because if different modules have submodules with the same name than whatever is imported last will overrite what came before it - i.e. naming collision -> overwriting!.
    
    4. from numpy import array
    
    - will import only the indicated submodules into the global namespace
    - e.g. array([1,2,3])
    - Note: can be ok since you are being explicit
    
    We will generally use 2 and 4 (sparingly)

### NumPy

As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as various data science techniques rely upon it (e.g. Linear Regression, Support Vector Machines, Recommender Systems, Neural Networks).

NumPy is a package designed to be used in scientific computing, and specifically around building and manipulating N-dimensional array objects.

In [2]:
import numpy

import numpy as np

from numpy import absolute

# The next one is dangerous to do, and not recommended 
# except in cases where you know why you're using it
# from numpy import *

Now we can do the same thing three ways

In [3]:
print numpy.absolute(-10) ## absolute is a method
print np.absolute(-10)
print absolute(-10)

10
10
10


We can create the Linear Algebra Objects we saw in lecture

In [5]:
vector = np.array([1, 2, 1])
vector 

array([1, 2, 1])

In [6]:
data = np.array([[1, 2, 3],[2, 4, 9]])
data

array([[1, 2, 3],
       [2, 4, 9]])

In [19]:
v = np.array([[1],[2],[1]])
v

array([[1],
       [2],
       [1]])

In [7]:
data[0]  # first row

array([1, 2, 3])

In [15]:
data[ : , 1]  # all rows, second column # number after colon will specify location

array([2, 4])

Numpy lets us perform matrix operations

In [16]:
np.dot(data, vector)

array([ 8, 19])

We can transpose an array

In [20]:
print data
print data.T

[[1 2 3]
 [2 4 9]]
[[1 2]
 [2 4]
 [3 9]]


In [23]:
np.arange(25)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

#### Creating a square matrix array

In [22]:
a = np.arange(25).reshape(5,5)
# arange(n) is a function that creates a 1 row array of integers of length n 
# reshape(M,N) is a method converts a list to a matrix of size MxN
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [24]:
biga = a*10
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [25]:
print biga.mean()
print biga.mean(0) #Average per column
biga.mean(1) #average per row
# type(biga.mean(1))

120.0
[ 100.  110.  120.  130.  140.]


array([  20.,   70.,  120.,  170.,  220.])

In [29]:
biga.mean(1)

array([  20.,   70.,  120.,  170.,  220.])

Creating a matrix with numpy

In [30]:
bigm = np.matrix(biga-20)
bigm

matrix([[-20, -10,   0,  10,  20],
        [ 30,  40,  50,  60,  70],
        [ 80,  90, 100, 110, 120],
        [130, 140, 150, 160, 170],
        [180, 190, 200, 210, 220]])

Creating the Inverse of a Matrix

In [31]:
np.linalg.inv(biga-20)

array([[ -2.81474977e+13,   2.22222222e-02,   5.62949953e+13,
          0.00000000e+00,  -2.81474977e+13],
       [  3.51843721e+13,   0.00000000e+00,  -5.27765581e+13,
         -3.51843721e+13,   5.27765581e+13],
       [ -4.22212465e+13,   9.38249922e+13,  -7.97512434e+13,
          4.69124961e+13,  -1.87649984e+13],
       [  9.14793674e+13,  -1.87649984e+14,   9.26521798e+13,
          1.17281240e+13,  -8.20968682e+12],
       [ -5.62949953e+13,   9.38249922e+13,  -1.64193736e+13,
         -2.34562481e+13,   2.34562481e+12]])

#### Slices

In [32]:
type(bigm)

numpy.matrixlib.defmatrix.matrix

In [33]:
bigm = np.array(bigm)
print bigm
bigm[0]

[[-20 -10   0  10  20]
 [ 30  40  50  60  70]
 [ 80  90 100 110 120]
 [130 140 150 160 170]
 [180 190 200 210 220]]


array([-20, -10,   0,  10,  20])

In [34]:
#Same thing, but demonstrating the full slice with a colon
print biga
biga[0,:]

[[  0  10  20  30  40]
 [ 50  60  70  80  90]
 [100 110 120 130 140]
 [150 160 170 180 190]
 [200 210 220 230 240]]


array([ 0, 10, 20, 30, 40])

In [39]:
biga[:,2]

array([ 20,  70, 120, 170, 220])

#### Describing your Arrays

In [40]:
compa = np.arange(30).reshape(5,3,2)
compa

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]],

       [[24, 25],
        [26, 27],
        [28, 29]]])

In [41]:
# lets describe it
print compa.shape
print compa.ndim
print compa.dtype

(5, 3, 2)
3
int64


In [42]:
compa[3,:,1]

array([19, 21, 23])

In [44]:
# We can assign values using list-like index
# But be careful on types
compa[0,0,0] = 5.9
compa[0,0,0]

5

We can change the datatype when needed

In [45]:
compa = compa.astype(float)
compa[0,0,0] = 5.75
compa[0,0,0]

5.75

#### Stacking arrays

You must stack using dimensions of the saem size

In [47]:
a = np.array((1,2,3))
b = np.array((2,3,4))
print 'H Stack'
print np.hstack((a,b))
print 'V Stack'
print np.vstack((a,b))

H Stack
[1 2 3 2 3 4]
V Stack
[[1 2 3]
 [2 3 4]]


In [48]:
a = np.array([[1],[2],[3]])
b = np.array([[2],[3],[4]])
print 'H Stack'
print np.hstack((a,b))
print 'V Stack'
print np.vstack((a,b))

H Stack
[[1 2]
 [2 3]
 [3 4]]
V Stack
[[1]
 [2]
 [3]
 [2]
 [3]
 [4]]


### Using Random Numbers

Random numbers are very helpful and are necessary at times for testing data pipelines and running statistical analyses. Functions for creating random values are under numpy.random.

In [49]:
#Create a randomized array
rm = np.random.rand(5,5)
rm

array([[ 0.39061437,  0.23591475,  0.05679977,  0.20119137,  0.57851436],
       [ 0.88754211,  0.49599677,  0.46333454,  0.5909792 ,  0.95600652],
       [ 0.906349  ,  0.05369232,  0.9509672 ,  0.25425329,  0.67763412],
       [ 0.92993601,  0.19299255,  0.16338799,  0.78138755,  0.19861134],
       [ 0.74372619,  0.06216616,  0.17597336,  0.20912107,  0.47518434]])

In [50]:
rm.shape

(5, 5)

You can shuffle values randomly as well

In [51]:
# This will shuffle along the first index of a multi-dimensional array
np.random.shuffle(rm)
rm

array([[ 0.92993601,  0.19299255,  0.16338799,  0.78138755,  0.19861134],
       [ 0.88754211,  0.49599677,  0.46333454,  0.5909792 ,  0.95600652],
       [ 0.906349  ,  0.05369232,  0.9509672 ,  0.25425329,  0.67763412],
       [ 0.39061437,  0.23591475,  0.05679977,  0.20119137,  0.57851436],
       [ 0.74372619,  0.06216616,  0.17597336,  0.20912107,  0.47518434]])

In [52]:
print rm.mean()
print rm.mean(0) #Average per column
print rm.mean(1) #average per row

0.465291049276
[ 0.77163354  0.20815251  0.36209257  0.40738649  0.57719014]
[ 0.45326309  0.67877183  0.56857918  0.29260692  0.33323423]


In [53]:
# for a different Normal Distribution, use np.random.normal
rm = np.random.normal(5,9,(30,30)) #shift+tab
rm

array([[  3.51989643,  17.9537582 ,   7.8046342 ,  -6.11486712,
          2.0797233 ,   3.07614732,   4.20661469,   7.48296273,
          3.16099674,  -0.6479346 ,  10.97704437,   6.83587785,
          6.50532395,  -4.30138935,  14.44216677,  -6.41755407,
         -2.01697277,   9.36841953,  30.35378261,   3.00832808,
         13.01566778,   5.65678521, -11.05000728,   0.78576317,
          0.58624607,  13.16055363,   0.06342776, -19.27873593,
          4.0881701 ,   0.55266206],
       [  0.61662899,  -6.86934124,  11.20793801,   9.70341177,
         -7.14535816,   3.6239116 ,  -2.35356859,  -4.24953058,
          7.49802508,   3.2163933 ,  -1.09328063,  -1.36393225,
          2.55367756,   4.75822448,   7.25577113,  14.60085747,
         17.70541473,   4.60433785,   9.85439393,  16.23116938,
          9.79818441,   0.84167106,  10.78112979,  -3.34739437,
         10.3897669 , -21.31229323,  -1.73180367,   2.13663391,
         -7.11044053,   3.48084037],
       [  4.6084203 ,  15.1822

In [54]:
print rm.mean(), "which is hopefully close to the input mean"
print rm.var(), "which variance = stdev squared"
print np.median(rm)

4.92267602467 which is hopefully close to the input mean
74.6004305611 which variance = stdev squared
5.61417622161


Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

### Exercise 1

1) Create a 4x5 array of integers between 0 and 19

In [60]:
a = np.arange(20).reshape(4,5)
print a
print a.shape

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
(4, 5)


2) Create a 50x500 array with a mean of 20 and variance of 100. Save it to a variable called  biggie

In [70]:
biggie = np.random.normal(20,10,(50,500))

In [71]:
print biggie.mean()
print biggie.var()
print biggie.std()

20.0334351493
100.833731456
10.0416000446


3) Change the mean of the array to a value within 1 of 0 and the variance within 1 of 25. Think about what the mean and the variance represent and try using various mathematical operations.

In [92]:
biggie = biggie-20
biggie = biggie/2

print biggie.mean()
print biggie.var()

-18.7479103032
0.3938817635


## Pandas (python package)

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Pandas is great for tabular/indexed data

In [1]:
# NOTE: you should normally put all your imports at the top of the file
import pandas as pd

In [2]:
data = pd.read_csv('../data/nytimes.csv')

In [3]:
# Note here we're calling the head method on the dataframe to return the 'head' of the 
# dataframe, in this case the first 4 lines
# head() actually creates a new copy of the data, this is important later in the course!
data.head(4)

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
0,36,0,3,0,1
1,73,1,3,0,1
2,30,0,3,0,1
3,49,1,3,0,1


In [4]:
data[0:4] ## prints row-wise

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
0,36,0,3,0,1
1,73,1,3,0,1
2,30,0,3,0,1
3,49,1,3,0,1


In [5]:
# Each DataFrame has an index
# Sometimes you will need to reindex
data.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [6]:
data.tail(4)

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
458437,0,0,4,0,0
458438,72,1,5,0,1
458439,0,0,5,0,0
458440,0,0,3,0,0


In [7]:
# This is a Series
# A DataFrame is made of of several Series with the same index
data.Age

0     36
1     73
2     30
3     49
4     47
5     47
6      0
7     46
8     16
9     52
10     0
11    21
12     0
13    57
14    31
...
458426    49
458427    43
458428    40
458429    49
458430     0
458431    21
458432    30
458433    21
458434    61
458435    51
458436     0
458437     0
458438    72
458439     0
458440     0
Name: Age, Length: 458441, dtype: int64

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 458441 entries, 0 to 458440
Data columns (total 5 columns):
Age            458441 non-null int64
Gender         458441 non-null int64
Impressions    458441 non-null int64
Clicks         458441 non-null int64
Signed_In      458441 non-null int64
dtypes: int64(5)
memory usage: 21.0 MB


In [9]:
data.describe()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
count,458441.0,458441.0,458441.0,458441.0,458441.0
mean,29.482551,0.367037,5.007316,0.092594,0.70093
std,23.607034,0.481997,2.239349,0.309973,0.457851
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,3.0,0.0,0.0
50%,31.0,0.0,5.0,0.0,1.0
75%,48.0,1.0,6.0,0.0,1.0
max,108.0,1.0,20.0,4.0,1.0


In [10]:
data.Age.values

array([36, 73, 30, ..., 72,  0,  0])

In [11]:
# We can change this data into numpy
type(data.Age.values)

numpy.ndarray

#### Just like in numpy, we can use mean, var, and other functions on the data

In [12]:
print data.Age.mean()
print data.Age.var()
print data.Age.max()
print data.Age.min()

29.4825506445
557.292044027
108
0


In [13]:
# Function that groups users by age. ## Super useful in data cleaning and categorizing continuous 
 # variables into fewer buckets easier to deal with
def map_age_category(x):
    if x < 18:
        return '1'
    elif x < 25:
        return '2'
    elif x < 32:
        return '3'
    elif x < 45:
        return '4'
    else:
        return '5'

data['age_categories'] = data['Age'].apply(map_age_category) ##for each value in age, it will pply the function

In [14]:
data.head()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In,age_categories
0,36,0,3,0,1,4
1,73,1,3,0,1,5
2,30,0,3,0,1,3
3,49,1,3,0,1,5
4,47,1,11,0,1,5


#### Sorting data

In [15]:
data.sort_index(axis=1, ascending=False).head() #axis=1 is sorting by column, no axis or 0 is default by row

Unnamed: 0,age_categories,Signed_In,Impressions,Gender,Clicks,Age
0,4,1,3,0,0,36
1,5,1,3,1,0,73
2,3,1,3,0,0,30
3,5,1,3,1,0,49
4,5,1,11,1,0,47


In [16]:
data.sort('Signed_In').head()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In,age_categories
458440,0,0,3,0,0,1
149914,0,0,5,0,0,1
149915,0,0,10,1,0,1
149916,0,0,6,1,0,1
353762,0,0,3,0,0,1


In [17]:
ran_data = [
    ['a', 1]
    , ['b', 2]
    , ['c', 3]
]
df = pd.DataFrame(ran_data, columns=['col_a', 'numeric'])
df

Unnamed: 0,col_a,numeric
0,a,1
1,b,2
2,c,3


#### Indexing functions

Pandas Dataframes support various methods for indexing:

- .iloc
- .loc
- .ix

In [18]:
df

Unnamed: 0,col_a,numeric
0,a,1
1,b,2
2,c,3


In [19]:
# iloc accesses a row by its row number
df.iloc[0]

col_a      a
numeric    1
Name: 0, dtype: object

In [123]:
df.set_index('col_a', inplace = True) ##sets the col_a as index
df

Unnamed: 0_level_0,numeric
col_a,Unnamed: 1_level_1
a,1
b,2
c,3


In [20]:
# loc accesses a dataframe row by its index label (or column label)
df.loc['a'] = 5
df.loc['b'] = 3

In [21]:
df

Unnamed: 0,col_a,numeric
0,a,1
1,b,2
2,c,3
a,5,5
b,3,3


In [22]:
# This can be used to add new columns
df.loc[:,'C'] = df.loc[:,'numeric']

df

Unnamed: 0,col_a,numeric,C
0,a,1,1
1,b,2,2
2,c,3,3
a,5,5,5
b,3,3,3


In [23]:
# is equivalent to:
df['D'] = df.loc[:,'numeric']

df

Unnamed: 0,col_a,numeric,C,D
0,a,1,1,1
1,b,2,2,2
2,c,3,3,3
a,5,5,5,5
b,3,3,3,3


.ix is the generic form of indexers

Values can be set by index and index + column

In [24]:
df.ix[2] = 2
df.ix[2, 'C'] = 3

df

Unnamed: 0,col_a,numeric,C,D
0,a,1,1,1
1,b,2,2,2
2,2,2,3,2
a,5,5,5,5
b,3,3,3,3


In [25]:
df

Unnamed: 0,col_a,numeric,C,D
0,a,1,1,1
1,b,2,2,2
2,2,2,3,2
a,5,5,5,5
b,3,3,3,3


### Combining DataFrames
#### Appending
We can append dataframes together

In [26]:
df_combine = df.append(df)
df_combine

Unnamed: 0,col_a,numeric,C,D
0,a,1,1,1
1,b,2,2,2
2,2,2,3,2
a,5,5,5,5
b,3,3,3,3
0,a,1,1,1
1,b,2,2,2
2,2,2,3,2
a,5,5,5,5
b,3,3,3,3


In [27]:
# When DataFrames are appended together, we often need to create a new index
df_combine.reset_index()

Unnamed: 0,index,col_a,numeric,C,D
0,0,a,1,1,1
1,1,b,2,2,2
2,2,2,2,3,2
3,a,5,5,5,5
4,b,3,3,3,3
5,0,a,1,1,1
6,1,b,2,2,2
7,2,2,2,3,2
8,a,5,5,5,5
9,b,3,3,3,3


#### Join lets us join together dataframes using their index

In [28]:
df_2 = pd.DataFrame([1, 2, 3], columns=['col'])
df_2

Unnamed: 0,col
0,1
1,2
2,3


In [29]:
# The default is left join, so Null values are placed
# where values are misssing
df.join(df_2)


Unnamed: 0,col_a,numeric,C,D,col
0,a,1,1,1,1.0
1,b,2,2,2,2.0
2,2,2,3,2,3.0
a,5,5,5,5,
b,3,3,3,3,


In [31]:
# reset the index
df_1 = df.reset_index()
df_1

Unnamed: 0,index,col_a,numeric,C,D
0,0,a,1,1,1
1,1,b,2,2,2
2,2,2,2,3,2
3,a,5,5,5,5
4,b,3,3,3,3


In [32]:
# try joining again:
df_1.join(df_2)


Unnamed: 0,index,col_a,numeric,C,D,col
0,0,a,1,1,1,1.0
1,1,b,2,2,2,2.0
2,2,2,2,3,2,3.0
3,a,5,5,5,5,
4,b,3,3,3,3,


#### Merge allows us to join on any fields

In [33]:
# Merge has a default of inner join
# So where the join misses rows are omitted
df.merge(df_2, left_on='numeric', right_on='col')

Unnamed: 0,col_a,numeric,C,D,col
0,a,1,1,1,1
1,b,2,2,2,2
2,2,2,3,2,2
3,3,3,3,3,3


### Concat combines a list of DataFrames together

In [34]:
# It can be used like append
pd.concat([df, df])

Unnamed: 0,col_a,numeric,C,D
0,a,1,1,1
1,b,2,2,2
2,2,2,3,2
a,5,5,5,5
b,3,3,3,3
0,a,1,1,1
1,b,2,2,2
2,2,2,3,2
a,5,5,5,5
b,3,3,3,3


In [35]:
# But concat will create a spare DataFrame when columns don't match
# This can create huge dataframes when mismatches occur
pd.concat([df, df_2]) ##concat doesn't care about the columns matching so can create sparse df

Unnamed: 0,C,D,col,col_a,numeric
0,1.0,1.0,,a,1.0
1,2.0,2.0,,b,2.0
2,3.0,2.0,,2,2.0
a,5.0,5.0,,5,5.0
b,3.0,3.0,,3,3.0
0,,,1.0,,
1,,,2.0,,
2,,,3.0,,


## Exercise 2
### Combining numpy and pandas

1) Create 2 arrays of integers

One should be created using np.random

In [70]:
import numpy as np
a = np.array([[1,2],[5,6]])
b = np.random.rand(2,2)
print a
print b

[[1 2]
 [5 6]]
[[ 0.94382959  0.4027728 ]
 [ 0.45944587  0.58298883]]


2) Turn those arrays into pandas DataFrames

The columns can be named numerically

In [71]:
new_df1 = pd.DataFrame(a, columns=['col_a', 'col_b'])
new_df1

Unnamed: 0,col_a,col_b
0,1,2
1,5,6


In [72]:
new_df2 = pd.DataFrame(b,columns=['col_1','col_2'])
new_df2

Unnamed: 0,col_1,col_2
0,0.94383,0.402773
1,0.459446,0.582989


In [80]:
new_df1.loc[2,:] = new_df2.loc[1,:]
new_df1

Unnamed: 0,col_a,col_b
0,1.0,2.0
1,5.0,6.0
2,,


3) Use some of the summary functions on the dataframes and arrays

Show how mean and var give the same response in python and numpy

In [74]:
new_df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
col_a    2 non-null int64
col_b    2 non-null int64
dtypes: int64(2)
memory usage: 48.0 bytes


In [75]:
new_df1.describe()

Unnamed: 0,col_a,col_b
count,2.0,2.0
mean,3.0,4.0
std,2.828427,2.828427
min,1.0,2.0
25%,2.0,3.0
50%,3.0,4.0
75%,4.0,5.0
max,5.0,6.0



4) Add an extra index using .loc

In [78]:
new_df1

Unnamed: 0,col_a,col_b
0,1,2
1,5,6


5) Using merge or join, create a single DataFrame from the two

6) Try testing out the groupby functions

df.groupby(column).agg (agg can be an aggregate function, try sum, max, min...)

Resources can be found here: http://pandas.pydata.org/pandas-docs/stable/10min.html#grouping


## Plotting!

In [None]:
import pandas.io.data
import datetime
import matplotlib.pyplot as plt

%matplotlib inline

mu, sigma = 0, 0.1
normal_dist = np.random.normal(mu, sigma, 1000)
aapl = pd.io.data.get_data_yahoo('FB', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))
aapl.head()

## MatPlotLib

MatPlotLib is a standard, granular method for building visualizations. Although tried and true, it can be cumbersome compared to other higher level packages such as Seaborn or Bokeh. Note most visualization packages use matplotlib as their base.

In [None]:
fig = plt.figure(figsize=(20,16))

ax = fig.add_subplot(2,2,1)
ax.plot(aapl.index, aapl['Close'])
ax.set_title('Line plots', size=24)

ax = fig.add_subplot(2,2,2)
ax.plot(aapl['Close'], 'o')
ax.set_title('Scatter plots', size=24)

ax = fig.add_subplot(2,2,3)
ax.hist(normal_dist, bins=50)
ax.set_title('Histograms', size=24)
ax.set_xlabel('count', size=16)

ax = fig.add_subplot(2,2,4)
ax.boxplot(normal_dist)
ax.set_title('Boxplots', size=24)

### Bokeh
To install Bokeh, go to a terminal and type:

`conda install bokeh` 

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [None]:
from bokeh.plotting import figure, output_notebook,show
output_notebook()

In [None]:
# prepare some data
x = aapl.Low
y = aapl['High']

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# add a line renderer with legend and line thickness
p.circle(x, y, legend="High vs. Low", line_width=2)

# show the results
show(p)

In [None]:
x = aapl.index
y = aapl.Close
p = figure(title="Stock Open & Close over time", x_axis_label='Date', y_axis_label='High',x_axis_type="datetime")
# Note that I've declared the x_axis_type
p.square(x, y, legend="Close")
p.circle(x,aapl.Open,legend='Open',color='red')
# show the results
show(p)

## Pandas Plotting!

The plot method is a great, quick way to visualize your dataframes. By selecting the columns you care to view, calling .plot() on the dataframe defaults to a line chart vs. the index.

We will be revisiting this so just take a second to appreciate what can be done with one line of code.

In [None]:
aapl[['Open','Close']].plot()

In [None]:
aapl[['High','Low','Open','Close']].plot(kind='box')