## Vectorizing Operations

The first element on our checklist was to examine data types.

If we have downcast all our data, what should we do next?

### What is vectorization?
Vectorization is doing operations on sets of values instead of individual values.

Example of a numpy function that is "vectorized":

In [22]:
import numpy as np

In [23]:
%timeit np.sum(np.arange(100000))

148 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Compare this to a function that is not vectorized:

In [24]:
def sum_test():    
    total = 0
    for i in np.arange(100000):
        total += i
    return total

%timeit sum_test

62.7 ns ± 8.64 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


Orders of magnitude faster.

Often it feels easier to use for loops because they are easier to conceive and write. Taking time to vectorize an operation can help speed up code significantly, however.

Another example of an operation we already vectorized:

To look at crop rotation, we had to assign values to an array based on the values of other arrays.

The non-vectorized version:

In [25]:
y16 = np.random.randint(0,2,size=1000000).reshape(1000,1000)
y17 = np.random.randint(0,2,size=1000000).reshape(1000,1000)

In [26]:
def non_vect():
    rotate = np.zeros([1000,1000])
    for i in range(1000):
        for j in range(1000):
            rotate[i,j] = (1-y16[i,j])*y17[i,j]
    return rotate

%timeit non_vect()

1.31 s ± 39.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The vectorized version:

In [29]:
(1-y16)*y17

array([[0, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 1, 0],
       [1, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [1, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [27]:
%timeit (1-y16)*y17

4.29 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Which is quite a bit of a speed improvement.

Another useful aspect of vectorization is that `numpy` will __broadcast__ arrays together:

In [30]:
import numpy as np
sample = np.random.normal(loc=[4., 20.], scale=[1., 3.5],
                           size=(10, 2))

In [31]:
sample

array([[ 4.12618333, 27.44199882],
       [ 5.93368869, 22.28037346],
       [ 4.74557192, 27.30431394],
       [ 4.63956065, 21.38107758],
       [ 5.02243538, 19.29852652],
       [ 4.55648092, 18.48373739],
       [ 2.475351  , 22.50960304],
       [ 3.86778336, 19.05543055],
       [ 4.52652202, 17.56111298],
       [ 3.50427429, 20.34526696]])

In [32]:
sample.mean(axis=0)

array([ 4.33978515, 21.56614412])

In [33]:
mu = sample.mean(axis=0)

How could we demean this?

In [36]:
sample - mu

array([[-0.21360183,  5.87585469],
       [ 1.59390353,  0.71422934],
       [ 0.40578677,  5.73816981],
       [ 0.29977549, -0.18506654],
       [ 0.68265022, -2.2676176 ],
       [ 0.21669576, -3.08240674],
       [-1.86443416,  0.94345892],
       [-0.47200179, -2.51071357],
       [ 0.18673687, -4.00503115],
       [-0.83551087, -1.22087716]])

Why did this work? These are not compatible arrays!

When an operation is done between two mismatched arrays, `numpy` first checks whether the right most dimension matches. If they match, it broadcasts (essentially repeats) the array so that becomes the same shape as the other array.

In this case, it found an array that was (10,2) and (,2), so it simply repeated the mean 10 times so that it was (10,2). Then it did an element by element subtraction.

This could be used to standardize arrays:

In [37]:
std = sample.std(axis=0)

In [38]:
(sample -mu)/std

array([[-0.24208156,  1.79324883],
       [ 1.80642015,  0.21797526],
       [ 0.45989069,  1.75122886],
       [ 0.33974484, -0.05648036],
       [ 0.7736686 , -0.69205296],
       [ 0.24558801, -0.940718  ],
       [-2.11302087,  0.2879337 ],
       [-0.53493422, -0.76624328],
       [ 0.21163466, -1.22229323],
       [-0.9469103 , -0.37259882]])

You can very easily do this with pandas dataframes too:

In [39]:
import pandas as pd
sample_df = pd.DataFrame(sample)

In [40]:
(sample_df - sample_df.mean())/sample_df.std()

Unnamed: 0,0,1
0,-0.229659,1.701225
1,1.713721,0.206789
2,0.436291,1.661362
3,0.32231,-0.053582
4,0.733966,-0.656539
5,0.232985,-0.892443
6,-2.004588,0.273158
7,-0.507483,-0.726922
8,0.200774,-1.159569
9,-0.898318,-0.353478


### An aside: some pandas techniques.
We have not yet mentioned all of the ways to do pivot tables and data transformations in pandas.

Let' start with the "stack" function

In [17]:
import pandas as pd

In [18]:
df = pd.read_csv("panel_example.csv")

In [19]:
df = df.set_index(['i','t'])

Dataframes can use multi-level indices which can be useful:

In [20]:
df_wide = df.unstack()

df_wide

Unnamed: 0_level_0,value,value,value,value,value,value,value,value,value,value
t,0,1,2,3,4,5,6,7,8,9
i,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
0,4.688026,5.522693,3.794489,3.964962,2.41189,3.41173,5.471047,2.224358,3.014295,3.628636
1,20.763822,21.364728,16.473118,22.906417,20.685359,23.98535,17.449606,24.309989,19.684198,19.468067
2,13.34522,13.759424,15.433405,7.599836,12.815756,9.567005,13.707837,5.037831,9.862822,17.529517
3,2.874506,17.500661,21.6032,3.615383,12.067565,12.347808,6.555567,7.524997,-2.819713,-12.429077


So calling `unstack` will automatically use the outermost index. Setting the level will change this behavior.

In [21]:
df.unstack(level=0)

Unnamed: 0_level_0,value,value,value,value
i,0,1,2,3
t,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,4.688026,20.763822,13.34522,2.874506
1,5.522693,21.364728,13.759424,17.500661
2,3.794489,16.473118,15.433405,21.6032
3,3.964962,22.906417,7.599836,3.615383
4,2.41189,20.685359,12.815756,12.067565
5,3.41173,23.98535,9.567005,12.347808
6,5.471047,17.449606,13.707837,6.555567
7,2.224358,24.309989,5.037831,7.524997
8,3.014295,19.684198,9.862822,-2.819713
9,3.628636,19.468067,17.529517,-12.429077


A fancier version of this is the `pivot_table` function, which can handle statistics.

In [53]:
pd.pivot(df.reset_index(),columns="i",values='value',index="t")

i,0,1,2,3
t,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,4.688026,20.763822,13.34522,2.874506
1,5.522693,21.364728,13.759424,17.500661
2,3.794489,16.473118,15.433405,21.6032
3,3.964962,22.906417,7.599836,3.615383
4,2.41189,20.685359,12.815756,12.067565
5,3.41173,23.98535,9.567005,12.347808
6,5.471047,17.449606,13.707837,6.555567
7,2.224358,24.309989,5.037831,7.524997
8,3.014295,19.684198,9.862822,-2.819713
9,3.628636,19.468067,17.529517,-12.429077


In [54]:
d = pd.read_csv("C:/Users/jhtchns2/classes/naab_example.csv")

In [60]:
d['id'] = d['breed'] + d['country_of_origin'] + d['sire_code'].astype(str)

In [64]:
pd.pivot_table(d,index='id',columns=['period'],values='pta_milk')

period,2000-02,2000-05,2000-08,2000-11,2001-02,2001-05,2001-08,2001-11,2002-02,2002-05,...,2017-08,2017-12,2018-04,2018-08,2018-12,2019-04,2019-08,2019-12,2020-04,2020-08
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AY84010863,,,,,,,,,,,...,,,,,1054.142857,1054.142857,975.038847,956.241855,954.899273,932.919628
AY8401881,,,,,,,,,,,...,,,,,,,,,,
AY8401885,,,,,,,,,,,...,,,,,,,891.235589,879.487469,768.072290,
AY8401886,,,,,,,,,,,...,1256.993734,1282.056391,1300.070175,1318.867168,1410.502506,1246.028822,1040.828321,1115.233083,1093.371038,840.605119
AY8401887,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WWUSA9334,,,,,,,,,,,...,,,,,,,,,,
WWUSA9335,,,,,,,,,,970.0,...,,,,,,,,,,
WWUSA9336,,,,,,,,,,,...,,,,,,,,,,
WWUSA9374,,,,,,,,,,,...,,,,,,,,,,


In [1]:
pd.pivot_table(d,index='breed',columns=['period'],values='pta_milk')

NameError: name 'pd' is not defined