# import numpy as np

## Workshop: NumPy and Data Representation

NumPy Provides
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

### Cheatsheet

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

More cheatsheets:
https://www.datacamp.com/community/data-science-cheatsheets?page=3

### References

1. https://docs.scipy.org/doc/numpy-dev/user/basics.types.html
2. Python Data Science Handbook by Jake VanderPlas
3. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney

### Example Dataset

Source: data.gov.sg

Dataset: Exchange Rates, SGD per unit of USD

In [153]:
from IPython.display import IFrame

IFrame('https://data.gov.sg/dataset/exchange-rates-sgd-per-unit-of-usd-average-for-period-annual/resource/f927c39b-3b44-492e-8b54-174e775e0d98/view/43207b9f-1554-4afb-98fe-80dfdd6bb4f6', width=600, height=400)

1. Go to https://data.gov.sg/dataset/exchange-rates-sgd-per-unit-of-usd-average-for-period-annual
2. Click on the `Download` button
3. Unzip and extract the `.csv` file. Note the path for use below.

In [155]:
# replace with your path
!dir D:\tmp\exchange-rates

 Volume in drive D is Local Disk
 Volume Serial Number is 4A49-4F1A

 Directory of D:\tmp\exchange-rates

28/05/2018  05:07 PM    <DIR>          .
28/05/2018  05:07 PM    <DIR>          ..
19/04/2017  11:26 PM               374 exchange-rates-sgd-per-unit-of-usd-average-for-period-annual.csv
19/04/2017  11:26 PM            87,418 exchange-rates-sgd-per-unit-of-usd-daily.csv
19/04/2017  11:26 PM             3,051 metadata-exchange-rates-sgd-per-unit-of-usd-average-for-period-annual.txt
               3 File(s)         90,843 bytes
               2 Dir(s)  246,905,012,224 bytes free


### Loading the data

The dataset contains two CSV files. We'll be using the daily exchange rate as our practice for NumPy.

We'll be using Pandas to read the CSV files.

Pandas will be covered in more detail in the next Workshop.

In [217]:
import pandas as pd

# we are using some pandas tricks to parse dates automagically
df = pd.read_csv('D:/tmp/exchange-rates/exchange-rates-sgd-per-unit-of-usd-daily.csv',
                     parse_dates=True, index_col=0, infer_datetime_format=True,
                     squeeze=True)

# inspect the pandas DataFrame
df

date
1988-01-08    2.0443
1988-01-15    2.0313
1988-01-22    2.0205
1988-01-29    2.0182
1988-02-05    2.0160
1988-02-12    2.0173
1988-02-19    2.0189
1988-02-26    2.0130
1988-03-04    2.0154
1988-03-11    2.0131
1988-03-18    2.0184
1988-03-25    2.0132
1988-03-31    2.0045
1988-04-08    2.0030
1988-04-15    2.0019
1988-04-22    2.0037
1988-04-29    2.0016
1988-05-06    2.0039
1988-05-13    2.0052
1988-05-20    2.0128
1988-05-27    2.0168
1988-06-03    2.0217
1988-06-10    2.0180
1988-06-17    2.0246
1988-06-24    2.0346
1988-07-01    2.0483
1988-07-08    2.0443
1988-07-15    2.0503
1988-07-22    2.0384
1988-07-29    2.0378
               ...  
2015-09-04    1.4183
2015-09-07    1.4249
2015-09-08    1.4262
2015-09-09    1.4140
2015-09-10    1.4212
2015-09-14    1.4104
2015-09-15    1.4011
2015-09-16    1.4008
2015-09-17    1.3988
2015-09-18    1.4003
2015-09-21    1.4050
2015-09-22    1.4128
2015-09-23    1.4219
2015-09-25    1.4243
2015-09-28    1.4259
2015-09-29    1.4319
2015-09-

### Import numpy

In [163]:
import numpy as np

In [161]:
# show help
np?

### Basic Data Structures

![scalar vector matrix tensor](assets/numpy/scalar-vector-matrix-tensor.png)

(image: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/)

### Get the NumPy array from Pandas

In [172]:
# pandas.DataFrame.values returns a numpy array
array = df.values
array

# note the array has a mixture of strings and floating point numbers
# it is of dtype=object

array([['1988-01-08', 2.0443],
       ['1988-01-15', 2.0313],
       ['1988-01-22', 2.0205],
       ..., 
       ['2015-10-15', 1.3763],
       ['2015-10-16', 1.3834],
       ['2015-10-19', 1.3827]], dtype=object)

In [167]:
# Get the shape (rows, columns)
array.shape

(3993, 2)

### Scalar

A scalar is a single value. 

It is a tensor of rank 0. (0 dimension)

In [169]:
# indexing the array to get a scalar value
array[0][0]

'1988-01-08'

In [171]:
# another scalar value
array[0][1]

2.0443

### Vector

A vector is a list of values. 

It is a tensor of rank 1. (1 dimension)

In [177]:
# indexing the array to get a row vector
first_row = array[0]
first_row

array(['1988-01-08', 2.0443], dtype=object)

In [181]:
first_row.shape

(2,)

In [183]:
# get the rank
first_row.ndim

1

In [179]:
# indexing the array to get a column vector
years = array[:,0] # column of all the years
years

array(['1988-01-08', '1988-01-15', '1988-01-22', ..., '2015-10-15',
       '2015-10-16', '2015-10-19'], dtype=object)

In [180]:
years.shape

(3993,)

### Matrix

A matrix is a 2-dimensional array of values.

It is a tensor of rank 2.

In [184]:
array.ndim

2

### Tensor (rank $\geq$ 3)

We can create a tensor by stacking matrices along a third axis.

Let's say we have some data for the daily exchange rates for Singapore Dollar and Renminbi, and we would like to use currency as a third axis.

(data source: https://www.exchangerates.org.uk)

In [216]:
df_rmb = pd.read_csv('data/sgd_cny_rates_daily.csv',
                     parse_dates=True, index_col=0, infer_datetime_format=True,
                     squeeze=True)
df_rmb

Date
2018-05-27    4.7499
2018-05-26    4.7620
2018-05-25    4.7610
2018-05-24    4.7618
2018-05-23    4.7553
2018-05-22    4.7554
2018-05-21    4.7657
2018-05-20    4.7544
2018-05-19    4.7500
2018-05-18    4.7497
2018-05-17    4.7426
2018-05-16    4.7558
2018-05-15    4.7410
2018-05-14    4.7440
2018-05-13    4.7547
2018-05-12    4.7399
2018-05-11    4.7330
2018-05-10    4.7472
2018-05-09    4.7256
2018-05-08    4.7523
2018-05-07    4.7662
2018-05-06    4.7669
2018-05-05    4.7666
2018-05-04    4.7668
2018-05-03    4.7743
2018-05-02    4.7634
2018-05-01    4.7579
2018-04-30    4.7764
2018-04-29    4.7864
2018-04-28    4.7842
               ...  
2009-11-04    4.8842
2009-11-03    4.8768
2009-11-02    4.8791
2009-11-01    4.8748
2009-10-31    4.8748
2009-10-30    4.8838
2009-10-29    4.8788
2009-10-28    4.8742
2009-10-27    4.8869
2009-10-26    4.8957
2009-10-25    4.8995
2009-10-24    4.8995
2009-10-23    4.9017
2009-10-22    4.8969
2009-10-21    4.9015
2009-10-20    4.9037
2009-10-

In [208]:
array_rmb = df_rmb.values
array_rmb

array([ 4.7499,  4.762 ,  4.761 , ...,  4.9082,  4.8754,  4.8734])

### Slicing

But wait, the date ranges don't line up. 

To fix this, we'll need to slice both arrays appropriately.

First, we need to find out the overlapping date ranges.

### Aside: Working with dates

[numpy.datetime64](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.datetime.html) supports manipulating dates and times.  

In [196]:
# Date range for the SGD-USD series
np.datetime64(array[0][0]) # first date

numpy.datetime64('1988-01-08')

In [198]:
# Date range for the SGD-USD series
np.datetime64(array[-1][0]) # last date

numpy.datetime64('2015-10-19')

In [203]:
# Date range for the SGD-CNY series
np.datetime64(df_rmb.index[0])

numpy.datetime64('2018-05-27T00:00:00.000000')

In [204]:
# Date range for the SGD-CNY series
np.datetime64(df_rmb.index[-1])

numpy.datetime64('2009-10-06T00:00:00.000000')

In [206]:
# Comparing dates
max(np.datetime64(array[0][0]), np.datetime64(df_rmb.index[0]))

numpy.datetime64('2018-05-27T00:00:00.000000')

### Slicing

Now that we've determined our ranges, 


### Data Structure Manipulation

In [None]:
# Array documentation
from numpy import doc

# Array types and conversions, scalars
doc.basics?

In [None]:
# Array indexing and slicing
doc.indexing?

### Indexing

In [None]:
A = np.random.random((3, 2))
A

In [None]:
A[0] # 1st row

In [None]:
A[1][0] # 2nd row, 1st column

In [None]:
A[-2][-1] # second-last row, last column

In [None]:
A[-4] # out of bounds access

In [None]:
A[A>0.5] # boolean indexing

Exercise: Try the above with a vector, and a Tensor

### Subsetting

A[index, ...]

index can be `:` for the axis (e.g. A[1,:])

In [132]:
A

array([[ 0.1819358 ,  0.07922454],
       [ 0.95426202,  0.34311652],
       [ 0.83740228,  0.24108659]])

In [None]:
A[1, 0] # 2nd row, 1st column

In [None]:
A[:,1] # 2nd column

In [None]:
A[0,:] # 1st row

### Slicing

data[start : stop : stepsize]

In [133]:
R = np.linspace(1, 10, 24).reshape(4, 3, 2)
R

array([[[  1.        ,   1.39130435],
        [  1.7826087 ,   2.17391304],
        [  2.56521739,   2.95652174]],

       [[  3.34782609,   3.73913043],
        [  4.13043478,   4.52173913],
        [  4.91304348,   5.30434783]],

       [[  5.69565217,   6.08695652],
        [  6.47826087,   6.86956522],
        [  7.26086957,   7.65217391]],

       [[  8.04347826,   8.43478261],
        [  8.82608696,   9.2173913 ],
        [  9.60869565,  10.        ]]])

In [None]:
R[1:2:1] # 2nd row along axis=0 

In [None]:
R[:, 1:2:1,] # 2nd row along axis=1

In [None]:
R[:, :, 1:2:1] # 2nd row along axis=2 

In [134]:
R[::2,] # every other row along axis=0

array([[[ 1.        ,  1.39130435],
        [ 1.7826087 ,  2.17391304],
        [ 2.56521739,  2.95652174]],

       [[ 5.69565217,  6.08695652],
        [ 6.47826087,  6.86956522],
        [ 7.26086957,  7.65217391]]])

### Slices are views, not copies

Changes the main array

In [147]:
R2 = R[0][1:2:1] # same as R[0, 1:2:1]
R2

array([[ 1.7826087 ,  2.17391304]])

In [150]:
R2[0] = 0
R2

array([[ 0.,  0.]])

In [151]:
R

array([[[  1.        ,   1.39130435],
        [  0.        ,   0.        ],
        [  2.56521739,   2.95652174]],

       [[  3.34782609,   3.73913043],
        [  4.13043478,   4.52173913],
        [  4.91304348,   5.30434783]],

       [[  5.69565217,   6.08695652],
        [  6.47826087,   6.86956522],
        [  7.26086957,   7.65217391]],

       [[  8.04347826,   8.43478261],
        [  8.82608696,   9.2173913 ],
        [  9.60869565,  10.        ]]])

### Transposing

In [None]:
R.T

In [None]:
R.T.shape

### Sorting

# import pandas as pd

## Workshop: Pandas and Data Transformation

# import matplotlib.pyplot as plt

## Workshop: Matplotlib and Data Visualization

# Putting everything together

## Workshop: Data Workflow

## Assessment 1: Data Workflow