# Linear Algebra

#### Importing Packages
Python has an extensive collection of third-party libraries, or ***packages***, with additional functions, data-structures, etc.  Many (most?) packages of interest are hosted on the Python Package Index ***pypi***, and can be installed into your environment using ***pip***.  The Anaconda distribution in the pre-work includes a number of these that are useful in data science, so you should have most of them installed already.  

ref:  https://pypi.python.org/pypi

A note on namespaces when importing - there are a few different ways to import:

    1. import numpy

    - all submodules of numpy are accessible in the numpy namespace
    - e.g. numpy.array([1,2,3])


    2. import numpy as np

    - same as 1 except an alias 'np' is created for the namespace instead
    - e.g. np.array([1,2,3])


    3. from numpy import *

    - adds all submodules to global namespace
    - e.g. array([1,2,3])
    - Note: This can be dangerous because if different modules have submodules with the same name than whatever is imported last will overrite what came before it - i.e. naming collision -> overwriting!.
    
    4. from numpy import array
    
    - will import only the indicated submodules into the global namespace
    - e.g. array([1,2,3])
    - Note: can be ok since you are being explicit
    
    We will generally use 2 and 4 (sparingly)

### NumPy

As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as various data science techniques rely upon it (e.g. Linear Regression, Support Vector Machines, Recommender Systems, Neural Networks).

NumPy is a package designed to be used in scientific computing, and specifically around building and manipulating N-dimensional array objects.

In [2]:
import numpy

import numpy as np

from numpy import absolute

# The next one is dangerous to do, and not recommended 
# except in cases where you know why you're using it
# from numpy import *

Now we can do the same thing three ways

In [3]:
print numpy.absolute(-10)
print np.absolute(-10)
print absolute(-10)

10
10
10


We can create the Linear Algebra Objects we saw in lecture

In [4]:
vector = np.array([1, 2, 1])

In [10]:
data = np.array([[1, 2, 3],[2, 4, 9]])
data

3.5

In [6]:
data[0]  # first row

array([1, 2, 3])

In [8]:
data[ : 2, 1]  # all rows, second column

array([2, 4])

Numpy lets us perform matrix operations

In [9]:
np.dot(data, vector)

array([ 8, 19])

We can transpose an array

In [9]:
print data
print data.T

[[1 2 3]
 [2 4 9]]
[[1 2]
 [2 4]
 [3 9]]


#### Creating a square matrix array

In [14]:
a = np.arange(25).reshape(5,5)
# arange(n) is a function that creates a 1 row array of integers of length n 
# reshape(M,N) is a method converts a list to a matrix of size MxN
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [15]:
biga = a*10
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [16]:
print biga.mean()
print biga.mean(0) #Average per column 0 means column
biga.mean(1) #average per row 1 means row
# type(biga.mean(1))

120.0
[ 100.  110.  120.  130.  140.]


array([  20.,   70.,  120.,  170.,  220.])

Creating a matrix with numpy

In [17]:
bigm = np.matrix(biga-20)
bigm

matrix([[-20, -10,   0,  10,  20],
        [ 30,  40,  50,  60,  70],
        [ 80,  90, 100, 110, 120],
        [130, 140, 150, 160, 170],
        [180, 190, 200, 210, 220]])

Creating the Inverse of a Matrix

In [18]:
np.linalg.inv(biga-20)

array([[ -5.82741163e+12,  -3.82630046e+13,   6.47978853e+13,
          8.50288992e+12,  -2.92103589e+13],
       [  1.03354093e+13,   4.66192930e+13,  -6.97823380e+13,
         -4.16348403e+13,   5.44624760e+13],
       [ -6.86095256e+13,   1.33700614e+14,  -7.97512434e+13,
          3.28387473e+13,  -1.81785922e+13],
       [  1.29522470e+14,  -2.54207088e+14,   1.09657960e+14,
          2.52154667e+13,  -1.01888078e+13],
       [ -6.54209419e+13,   1.12150186e+14,  -2.49222636e+13,
         -2.49222636e+13,   3.11528295e+12]])

#### Slices

In [19]:
bigm = np.array(bigm)
print bigm
bigm[0]

[[-20 -10   0  10  20]
 [ 30  40  50  60  70]
 [ 80  90 100 110 120]
 [130 140 150 160 170]
 [180 190 200 210 220]]


array([-20, -10,   0,  10,  20])

In [21]:
#Same thing, but demonstrating the full slice with a colon
print biga
#biga[2,:]
biga[:,2]

[[  0  10  20  30  40]
 [ 50  60  70  80  90]
 [100 110 120 130 140]
 [150 160 170 180 190]
 [200 210 220 230 240]]


array([ 20,  70, 120, 170, 220])

#### Describing your Arrays

In [22]:
compa = np.arange(30).reshape(5,3,2)
compa

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]],

       [[24, 25],
        [26, 27],
        [28, 29]]])

In [23]:
# lets describe it
print compa.shape
print compa.ndim
print compa.dtype

(5L, 3L, 2L)
3
int32


In [24]:
compa[3,:,1]

array([19, 21, 23])

In [25]:
# We can assign values using list-like index
# But be careful on types
compa[0,0,0] = 5.9
compa[0,0,0]

5

We can change the datatype when needed

In [26]:
compa = compa.astype(float)
compa[0,0,0] = 5.75
compa[0,0,0]
compa

array([[[  5.75,   1.  ],
        [  2.  ,   3.  ],
        [  4.  ,   5.  ]],

       [[  6.  ,   7.  ],
        [  8.  ,   9.  ],
        [ 10.  ,  11.  ]],

       [[ 12.  ,  13.  ],
        [ 14.  ,  15.  ],
        [ 16.  ,  17.  ]],

       [[ 18.  ,  19.  ],
        [ 20.  ,  21.  ],
        [ 22.  ,  23.  ]],

       [[ 24.  ,  25.  ],
        [ 26.  ,  27.  ],
        [ 28.  ,  29.  ]]])

#### Stacking arrays

You must stack using dimensions of the saem size

In [36]:
a = np.array((1,2,3)) 
#((1,2,3),(2,7,8))
b = np.array((2,3,4))
print a
print b
print 'H Stack'
print np.hstack((a,b))
print 'V Stack'
print np.vstack((a,b))

[1 2 3]
[2 3 4]
H Stack
[1 2 3 2 3 4]
V Stack
[[1 2 3]
 [2 3 4]]


In [41]:
a = np.array([[1],[2],[3]])
b = np.array([[2],[3],[4]])
print 'H Stack'
print np.hstack((a,b))
print 'V Stack'
print np.vstack((a,b))

H Stack
[[1 2]
 [2 3]
 [3 4]]
V Stack
[[1]
 [2]
 [3]
 [2]
 [3]
 [4]]


### Using Random Numbers

Random numbers are very helpful and are necessary at times for testing data pipelines and running statistical analyses. Functions for creating random values are under numpy.random.

In [42]:
#Create a randomized array
rm = np.random.rand(5,5)
rm

array([[  7.13319258e-01,   8.42189069e-01,   1.63225316e-01,
          4.80053118e-01,   3.84307985e-01],
       [  2.13512146e-01,   1.66004596e-01,   4.21936434e-01,
          5.50544838e-01,   5.67452578e-01],
       [  7.19079230e-01,   4.22443409e-01,   4.25580226e-01,
          9.19818585e-01,   2.62239719e-01],
       [  5.22630649e-01,   4.74415878e-01,   2.70364993e-01,
          2.82488948e-01,   6.07233267e-04],
       [  9.09001822e-01,   9.85897173e-01,   8.47608177e-01,
          2.98684082e-01,   8.75871756e-02]])

In [43]:
rm.shape

(5L, 5L)

You can shuffle values randomly as well

In [37]:
# This will shuffle along the first index of a multi-dimensional array
#shuffles within a column
np.random.shuffle(rm)
rm

array([[ 0.6199001 ,  0.40078572,  0.02607153,  0.59273285,  0.27642771],
       [ 0.42070097,  0.26947697,  0.08005154,  0.38194918,  0.91198734],
       [ 0.4492709 ,  0.20417048,  0.74580244,  0.52165806,  0.19525212],
       [ 0.00361691,  0.18788956,  0.42722888,  0.19975897,  0.61429637],
       [ 0.84994489,  0.61413401,  0.37165229,  0.12813628,  0.80635363]])

In [38]:
print rm.mean()
print rm.mean(0) #Average per column
print rm.mean(1) #average per row

0.411969988774
[ 0.46868676  0.33529135  0.33016134  0.36484707  0.56086343]
[ 0.38318358  0.4128332   0.4232308   0.28655814  0.55404422]


In [44]:
# for a different Normal Distribution, use np.random.normal
rm = np.random.normal(5,9,(30,30))
rm

array([[  1.35801903e+01,   1.73449198e+01,   1.65381083e+01,
          7.66431140e+00,  -3.01714189e+00,   1.65777508e+01,
          9.81036425e+00,   1.58782163e+01,   2.13579907e+01,
          1.18148704e+01,  -1.86678596e+00,  -5.48236247e+00,
          4.60506525e+00,   6.04445758e+00,   6.94409811e+00,
          1.04781371e+01,  -9.22374829e+00,   8.09120048e+00,
          5.22432821e+00,   1.47308180e+01,  -4.41229431e+00,
          6.67508220e+00,  -1.09807227e+00,   1.62048868e+01,
         -4.94282449e-01,   6.65654897e+00,   1.48148553e+01,
          8.75876664e+00,  -6.58185089e+00,   6.03886988e+00],
       [  5.08628506e+00,   2.26366477e+01,   7.26021901e+00,
          1.76297778e+01,  -6.96405720e+00,   1.10944532e+01,
         -7.58976838e+00,  -6.66196629e+00,   9.16048346e+00,
          1.78545132e+01,   2.47991278e+01,  -1.38377915e+00,
          1.47224797e+01,  -7.14958743e+00,   1.00324069e+01,
          1.18760894e+01,   1.78481935e+00,  -1.02950117e+00,
       

In [45]:
print rm.mean(), "which is hopefully close to the input mean"
print rm.var(), "which variance = stdev squared"
print np.median(rm)

5.45165392009 which is hopefully close to the input mean
85.3815749941 which variance = stdev squared
5.34887644929


Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

### Exercise 1

1) Create a 4x5 array of integers between 0 and 19

In [43]:
abc = np.arange(20).reshape(4,5)
abc

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

2) Create a 50x500 array with a mean of 20 and variance of 100. Save it to a variable called  biggie

In [58]:
biggie3 = np.random.normal(20,10,(50,500))

In [59]:
biggie3.var()

100.31433595961283

In [60]:
biggie3  = biggie3 - 20
biggie3 = biggie/2

## Pandas (python package)

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Pandas is great for tabular/indexed data

In [47]:
# NOTE: you should normally put all your imports at the top of the file
import pandas as pd

In [48]:
data = pd.read_csv('../data/nytimes.csv')

In [49]:
# Note here we're calling the head method on the dataframe to return the 'head' of the 
# dataframe, in this case the first 4 lines
# head() actually creates a new copy of the data, this is important later in the course!
data.head(4)

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
0,36,0,3,0,1
1,73,1,3,0,1
2,30,0,3,0,1
3,49,1,3,0,1


In [53]:
data[0:5]

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
0,36,0,3,0,1
1,73,1,3,0,1
2,30,0,3,0,1
3,49,1,3,0,1
4,47,1,11,0,1


In [66]:
# Each DataFrame has an index
# Sometimes you will need to reindex
data.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9, 
            ...
            458431, 458432, 458433, 458434, 458435, 458436, 458437, 458438,
            458439, 458440],
           dtype='int64', length=458441)

In [67]:
# This is a Series
# A DataFrame is made of of several Series with the same index
data.Age

0         36
1         73
2         30
3         49
4         47
5         47
6          0
7         46
8         16
9         52
10         0
11        21
12         0
13        57
14        31
15         0
16        40
17        31
18        38
19         0
20        59
21        61
22        48
23        29
24         0
25        19
26        19
27        48
28        48
29        21
          ..
458411    55
458412    68
458413     0
458414    21
458415    35
458416    26
458417    41
458418    58
458419    46
458420    45
458421    46
458422    47
458423    22
458424    21
458425    40
458426    49
458427    43
458428    40
458429    49
458430     0
458431    21
458432    30
458433    21
458434    61
458435    51
458436     0
458437     0
458438    72
458439     0
458440     0
Name: Age, dtype: int64

In [68]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 458441 entries, 0 to 458440
Data columns (total 5 columns):
Age            458441 non-null int64
Gender         458441 non-null int64
Impressions    458441 non-null int64
Clicks         458441 non-null int64
Signed_In      458441 non-null int64
dtypes: int64(5)
memory usage: 21.0 MB


In [69]:
data.describe() # only works on numbers

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
count,458441.0,458441.0,458441.0,458441.0,458441.0
mean,29.482551,0.367037,5.007316,0.092594,0.70093
std,23.607034,0.481997,2.239349,0.309973,0.457851
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,3.0,0.0,0.0
50%,31.0,0.0,5.0,0.0,1.0
75%,48.0,1.0,6.0,0.0,1.0
max,108.0,1.0,20.0,4.0,1.0


In [54]:
# We can change this data into numpy
type(data.Age.values)

numpy.ndarray

#### Just like in numpy, we can use mean, var, and other functions on the data

In [71]:
print data.Age.mean()
print data.Age.var()
print data.Age.max()
print data.Age.min()

29.4825506445
557.292044027
108
0


In [72]:
# Function that groups users by age.
def map_age_category(x):
    if x < 18:
        return '1'
    elif x < 25:
        return '2'
    elif x < 32:
        return '3'
    elif x < 45:
        return '4'
    else:
        return '5'

data['age_categories'] = data['Age'].apply(map_age_category)

In [73]:
data.head()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In,age_categories
0,36,0,3,0,1,4
1,73,1,3,0,1,5
2,30,0,3,0,1,3
3,49,1,3,0,1,5
4,47,1,11,0,1,5


#### Sorting data

In [57]:
data.sort_index(axis=0, ascending=False).head()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
458440,0,0,3,0,0
458439,0,0,5,0,0
458438,72,1,5,0,1
458437,0,0,4,0,0
458436,0,0,2,0,0


In [69]:
data.sort(['Signed_In','Impressions']).head()

Unnamed: 0,Age,Gender,Impressions,Clicks,Signed_In
300,0,0,0,0,0
1335,0,0,0,0,0
2799,0,0,0,0,0
2887,0,0,0,0,0
3105,0,0,0,0,0


In [81]:
ran_data = [
    ['a', 1]
    , ['b', 2]
    , ['c', 3]
]
df = pd.DataFrame(ran_data, columns=['col_a', 'numeric'])
df

Unnamed: 0,col_a,numeric
0,a,1
1,b,2
2,c,3


#### Indexing functions

Pandas Dataframes support various methods for indexing:

- .iloc
- .loc
- .ix

In [72]:
df

Unnamed: 0,col_a,numeric
0,a,1
1,b,2
2,c,3


In [81]:
# iloc accesses a row by its row number
df.iloc[0]

col_a      a
numeric    1
Name: 0, dtype: object

In [82]:
df.set_index('col_a', inplace = True)
df

Unnamed: 0_level_0,numeric
col_a,Unnamed: 1_level_1
a,1
b,2
c,3


In [83]:
# loc accesses a dataframe row by its index label (or column label)
df.loc['a'] = 5
df.loc['b'] = 3

In [84]:
df

Unnamed: 0_level_0,numeric
col_a,Unnamed: 1_level_1
a,5
b,3
c,3


In [85]:
# This can be used to add new columns
df.loc[:,'C'] = df.loc[:,'numeric']

df

Unnamed: 0_level_0,numeric,C
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1
a,5,5
b,3,3
c,3,3


In [83]:
# is equivalent to:
df['D'] = df.loc[:,'numeric']

df

Unnamed: 0,col_a,numeric,D
0,a,1,1
1,b,2,2
2,c,3,3


.ix is the generic form of indexers

Values can be set by index and index + column

In [84]:
df.ix[2] = 5
df.ix[2, 'numeric'] = 10

df

Unnamed: 0,col_a,numeric,D
0,a,1,1
1,b,2,2
2,5,10,5


### Combining DataFrames
#### Appending
We can append dataframes together

In [89]:
df_combine = df.append(df)
df_combine

Unnamed: 0_level_0,numeric,C,D
col_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,5,5,5
b,3,3,3
c,2,8,2
a,5,5,5
b,3,3,3
c,2,8,2


In [90]:
# When DataFrames are appended together, we often need to create a new index
df_combine.reset_index()

Unnamed: 0,col_a,numeric,C,D
0,a,5,5,5
1,b,3,3,3
2,c,2,8,2
3,a,5,5,5
4,b,3,3,3
5,c,2,8,2


#### Join lets us join together dataframes using their index

In [80]:
df_2 = pd.DataFrame([1, 2, 3], columns=['col'])
df_2

Unnamed: 0,col
0,1
1,2
2,3


In [88]:
# The default is left join, so Null values are placed
# where values are misssing
df.join(df_2)


Unnamed: 0,col_a,numeric,D,col
0,a,1,1,1
1,b,2,2,2
2,5,10,5,3


In [86]:
# reset the index
df_1 = df.reset_index()
df_1

Unnamed: 0,index,col_a,numeric,D
0,0,a,1,1
1,1,b,2,2
2,2,5,10,5


In [87]:
# try joining again:
df_1.join(df_2)


Unnamed: 0,index,col_a,numeric,D,col
0,0,a,1,1,1
1,1,b,2,2,2
2,2,5,10,5,3


#### Merge allows us to join on any fields

In [90]:
# Merge has a default of inner join
# So where the join misses rows are omitted
print df
print df_2
df.merge(df_2, left_on='numeric', right_on='col',how='outer')

  col_a  numeric  D
0     a        1  1
1     b        2  2
2     5       10  5
   col
0    1
1    2
2    3


Unnamed: 0,col_a,numeric,D,col
0,a,1.0,1.0,1.0
1,b,2.0,2.0,2.0
2,5,10.0,5.0,
3,,,,3.0


### Concat combines a list of DataFrames together

In [91]:
# It can be used like append
pd.concat([df, df])

Unnamed: 0,col_a,numeric,D
0,a,1,1
1,b,2,2
2,5,10,5
0,a,1,1
1,b,2,2
2,5,10,5


In [92]:
# But concat will create a spare DataFrame when columns don't match
# This can create huge dataframes when mismatches occur
pd.concat([df, df_2])

Unnamed: 0,D,col,col_a,numeric
0,1.0,,a,1.0
1,2.0,,b,2.0
2,5.0,,5,10.0
0,,1.0,,
1,,2.0,,
2,,3.0,,


## Exercise 2
### Combining numpy and pandas

1) Create 2 arrays of integers

One should be created using np.random

In [93]:
arr1 = np.arrange(10).shape(5,2)
arr2 = np.random.rand(2,2)

AttributeError: 'module' object has no attribute 'arrange'

2) Turn those arrays into pandas DataFrames

The columns can be named numerically

In [99]:
df1=pd.DataFrame(ex2a, columns=['col','col2'])

3) Use some of the summary functions on the dataframes and arrays

Show how mean and var give the same response in python and numpy


4) Add an extra index using .loc

5) Using merge or join, create a single DataFrame from the two

6) Try testing out the groupby functions

df.groupby(column).agg (agg can be an aggregate function, try sum, max, min...)

Resources can be found here: http://pandas.pydata.org/pandas-docs/stable/10min.html#grouping


## Plotting!

In [None]:
import pandas.io.data
import datetime
import matplotlib.pyplot as plt

%matplotlib inline

mu, sigma = 0, 0.1
normal_dist = np.random.normal(mu, sigma, 1000)
aapl = pd.io.data.get_data_yahoo('FB', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))
aapl.head()

## MatPlotLib

MatPlotLib is a standard, granular method for building visualizations. Although tried and true, it can be cumbersome compared to other higher level packages such as Seaborn or Bokeh. Note most visualization packages use matplotlib as their base.

In [None]:
fig = plt.figure(figsize=(20,16))

ax = fig.add_subplot(2,2,1)
ax.plot(aapl.index, aapl['Close'])
ax.set_title('Line plots', size=24)

ax = fig.add_subplot(2,2,2)
ax.plot(aapl['Close'], 'o')
ax.set_title('Scatter plots', size=24)

ax = fig.add_subplot(2,2,3)
ax.hist(normal_dist, bins=50)
ax.set_title('Histograms', size=24)
ax.set_xlabel('count', size=16)

ax = fig.add_subplot(2,2,4)
ax.boxplot(normal_dist)
ax.set_title('Boxplots', size=24)

### Bokeh
To install Bokeh, go to a terminal and type:

`conda install bokeh` 

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [None]:
from bokeh.plotting import figure, output_notebook,show
output_notebook()

In [None]:
# prepare some data
x = aapl.Low
y = aapl['High']

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# add a line renderer with legend and line thickness
p.circle(x, y, legend="High vs. Low", line_width=2)

# show the results
show(p)

In [None]:
x = aapl.index
y = aapl.Close
p = figure(title="Stock Open & Close over time", x_axis_label='Date', y_axis_label='High',x_axis_type="datetime")
# Note that I've declared the x_axis_type
p.square(x, y, legend="Close")
p.circle(x,aapl.Open,legend='Open',color='red')
# show the results
show(p)

## Pandas Plotting!

The plot method is a great, quick way to visualize your dataframes. By selecting the columns you care to view, calling .plot() on the dataframe defaults to a line chart vs. the index.

We will be revisiting this so just take a second to appreciate what can be done with one line of code.

In [None]:
aapl[['Open','Close']].plot()

In [None]:
aapl[['High','Low','Open','Close']].plot(kind='box')