# Scipy Basics

In this notebook we'll go over the basics of `scipy`, which provides useful scientific computing functions.

## 1. Importing scipy 
The first thing we do in every python script is import our libraries. Since we're talking about scipy today, we'll import scipy. We will also import numpy as it is a useful base package for many scipy functons.

One useful thing we can do is print out all the subroutines and submodules within the module to give us an overview of its capabilities.

In [1]:
import numpy as np
import scipy
#print("\n".join(scipy.__dir__())) # shows available objects, functions and submodules. Uncomment and execute to see full list.

## 2. Scientific computing vs Numerical computing?

Scipy provides scientific computing functions. But what is the difference between scientific computing and numerical computing?

One may notice that `numpy` and `scipy` have overlapping functionality. This is exposed in the `numpy.dual` package. Some of the overlapping functions are related to the `numpy.linalg`/`scipy.linalg` and `numpy.fft`/`scipy.fft` packages.

The `scipy` implementations of these often provide algorithms developed using advanced scientific methods, which often give shorter runtimes than the numerical versions. 

You can look at the source code for the differences in implementations, but the main thing you want to think about is how large your arrays are and whether or not you care about runtime.

Lets look at how array size affects runtime for numpy and scipy with this example below:

In [2]:
from scipy.fft import fft as sci_fft
from numpy.fft import fft as num_fft

import numpy as np
from time import time


def test_fft(array_size=10):
  arr = np.random.randn(array_size)

  time_start = time()
  arr_fft_num = num_fft(arr)
  time_end = time()
  time_numpy_fft = time_end - time_start

  time_start = time()
  arr_fft_sci = sci_fft(arr)
  time_end = time()
  time_scipy_fft = time_end - time_start

  # uncomment belw to print arrays and their difference for sanity
  # print(arr_fft_num, arr_fft_sci, np.sum(arr_fft_num - arr_fft_sci), sep='\n')

  # print time
  print("For array of size {}".format(array_size),
        "Numpy time: {}".format(time_numpy_fft),
        "Scipy time: {}".format(time_scipy_fft), sep='\n')
for i in range(1, 18, 3):
  test_fft(int(np.e**i))

For array of size 2
Numpy time: 0.00017309188842773438
Scipy time: 0.001257181167602539
For array of size 54
Numpy time: 3.981590270996094e-05
Scipy time: 5.507469177246094e-05
For array of size 1096
Numpy time: 0.0001633167266845703
Scipy time: 0.0001761913299560547
For array of size 22026
Numpy time: 0.00786590576171875
Scipy time: 0.0050907135009765625
For array of size 442413
Numpy time: 0.16072320938110352
Scipy time: 0.15365195274353027
For array of size 8886110
Numpy time: 3.8905670642852783
Scipy time: 3.5051729679107666


We can see that if we do this enough times, we end up adding significant runtime to our program.

## 3. Scipy Overview

Scipy has many internal submodules that provide different functionality. I'll group them into beginner, intermediate, and advanced based on their use cases:

### **Beginner**
#### `scipy.linalg`
Contains all `numpy.linalg` linear algebra functions, plus some more advanced ones.

#### `scipy.fft`
Contains functions for discrete fourier transforms. Like `scipy.linalg`, this includes a large overlap with `numpy.fft` discrete fourier transform functions.

#### `scipy.stats`
Contains statisical functions, such as continuous, multivariate, and discrete distributions, summary and frequency statistics, correlation functions, statistical tests, and more.


### **Intermediate**
#### `scipy.ndimage`
Contains functionality for multidimensional image processing, including filters, interpolation, measurments and morphology.

#### `scipy.io`
Contains functionality for loading and saving files from a variety of different formats and for different external systems, including MATLAB, IDL, Matrix Market, Fortran, Netcdf, Harwell-Boeing, Sound files (Wav) and Arff files.

#### `scipy.signal`
Contains signal processing functions, including convolution, correlation, spline creation, filter design and filtering functions, wavelets and waveforms, peak finding, and spectral analysis.

#### `scipy.interpolate`
Contains functionality for interpolation of N-dimensional objects, as well as tools for creating 1D and 2D splines, and a few advanced functions, such as Lagrange interpolating polynomials, Taylor polynomial estimations and Pade approximations.

### **Advanced**
#### `scipy.integrate`
Contains functions for integrating over single or multiple variables as well as ordinary differential equation solvers.

####  `scipy.optimize`
Contains functions for optimizing objective functions, least-squares and curve fitting, root finding, and linear programming.

#### `scipy.spatial`
Contains functions for spatial algorithms as well as data structures for these algorithms.

#### `scipy.sparse`
Contains functions for sparse data proceessing, including sparse matrix classes and functions, compressed sparse graph routines, and sparse eigenvalue problems.

#### `scipy.special`
Contains "special" functions, including many functions for mathematical physics as well as low level statistical functions.


## 4. Scipy Usage Examples

Finally, lets get into some basic examples from some of the packages we see above. I won't get into any advanced packages today, but they are easy to navigate using both the [numpy and scipy documentation](https://docs.scipy.org/doc/).

### Linear Algebra

#### `scipy` vs `numpy` for `linalg`
As we discussed earlier, scipy contains all linear algebra functionality that numpy provides, plus more. Furthermore, `scipy.linalg` provides [BLAS/LAPACK](http://www.netlib.org/blas/) support, if you use that. 

Lets do some basic linear algebra on both `numpy.ndarray`s and `numpy.matrix` objects. 

In [3]:
from scipy import linalg
# Define A and b for Ax=b
arr = np.random.randn(5, 5)  # 5x5 matrix as an array (A for Ax=b)
mat = np.mat(arr)            # 5x5 matrix as a matrix (A for Ax=b)
b = np.random.randn(5, 1) # 5x1 columnn vector (b for Ax=b
print(arr, mat, sep='\n')

[[-1.31090068  2.01393695  0.11428649  2.28762634  0.48636024]
 [ 1.17866815 -0.10991722 -0.26588125  1.39131198  0.69651684]
 [ 0.24109037  0.13141605  1.57989829  0.13784808 -2.23976287]
 [ 1.2235437   0.73344705  0.21991166 -0.18915872 -0.16787167]
 [ 1.79559633 -0.1962431  -0.51489064  0.35316694  0.19478688]]
[[-1.31090068  2.01393695  0.11428649  2.28762634  0.48636024]
 [ 1.17866815 -0.10991722 -0.26588125  1.39131198  0.69651684]
 [ 0.24109037  0.13141605  1.57989829  0.13784808 -2.23976287]
 [ 1.2235437   0.73344705  0.21991166 -0.18915872 -0.16787167]
 [ 1.79559633 -0.1962431  -0.51489064  0.35316694  0.19478688]]


#### Matrix Properties
matrix determinant and norm

In [4]:
det = linalg.det(mat) # matrix determinant
norm = linalg.norm(mat) # matrix norm
print(det, norm, sep='\n')

-8.181480538077606
5.347425670921321


#### Matrix inverse and Linear systems solving

We can use `scipy.linalg.inv` to calculate the inverse of an array.

In [5]:
inverse_arr = linalg.inv(arr) # inverts a multi dimensional array
print(inverse_arr)

[[-0.11318836  0.21125138  0.02150984  0.36168544  0.08626787]
 [ 0.31370956 -0.49300143 -0.10383771  0.50607634  0.22173965]
 [-0.51035728  1.50240601  0.09607909  1.0507211  -2.08768122]
 [ 0.19426452  0.25271172  0.1779296  -0.45694834  0.26342078]
 [-0.34182001  1.06914304 -0.37152937  0.78166688 -1.43411318]]


Given a system `A x = b`, We can solve the full system using the inverse with `Ai.dot(b)` (given `Ai` is the matrix inverse) or directly using `np.linalg.solve` which is quicker:

In [6]:
def test_solver(sz):
  sz = (sz, sz)
  mat = np.mat(np.random.randn(*sz))
  b = np.random.randn(sz[0], 1)
  time_start = time()
  inverse_mat = linalg.inv(mat) # inverts a matrix. same as above.
  x_scipy = inverse_mat.dot(b)
  time_end = time()
  time_scipy = time_end-time_start

  time_start = time()
  x_numpy = np.linalg.solve(mat, b)
  time_end = time()
  time_numpy = time_end-time_start

  error_scipy = sum(mat.dot(x_scipy) - b)
  error_numpy = sum(mat.dot(x_numpy) - b)
  print('Size: {}'.format(sz), 'Error Scipy: {}'.format(error_scipy), 'Error Numpy: {}'.format(error_numpy), 
    'Time Scipy: {}'.format(time_scipy), 'Time Numpy: {}'.format(time_numpy), sep='\n')
  
for i in range(5):
  test_solver((5**i))

Size: (1, 1)
Error Scipy: [[-2.22044605e-16]]
Error Numpy: [[0.]]
Time Scipy: 0.0003941059112548828
Time Numpy: 0.0014955997467041016
Size: (5, 5)
Error Scipy: [[-1.22124533e-15]]
Error Numpy: [[7.77156117e-16]]
Time Scipy: 0.00010204315185546875
Time Numpy: 5.14984130859375e-05
Size: (25, 25)
Error Scipy: [[-4.21954138e-14]]
Error Numpy: [[1.02529096e-13]]
Time Scipy: 0.0005040168762207031
Time Numpy: 0.010735750198364258
Size: (125, 125)
Error Scipy: [[3.34889885e-13]]
Error Numpy: [[-4.27848946e-14]]
Time Scipy: 0.0067043304443359375
Time Numpy: 0.000701904296875
Size: (625, 625)
Error Scipy: [[-3.21744871e-10]]
Error Numpy: [[2.48987522e-11]]
Time Scipy: 0.04827094078063965
Time Numpy: 0.02065277099609375


#### Decompositions

Solving for Eigenvalues and Eigenvectors

In [7]:
mat = np.mat(np.random.randn(5, 5))
eigv = linalg.eigvals(mat) # find eigenvalues
_, eigr = linalg.eig(mat) # default only returns eigenvalue and rigth eigenvector
_, eigl, eigr2 = linalg.eig(mat, left=True, right=True) # find eigenvalue and left/right eigenvectors
assert(np.round(np.sum(eigr-eigr2)) == 0) # make sure eigenvectors match
print(eigv, eigl, eigr, sep='\n')

[ 1.06958159+0.54763543j  1.06958159-0.54763543j -0.45171776+1.1345995j
 -0.45171776-1.1345995j  -0.86301768+0.j        ]
[[-0.36737622-0.22368159j -0.36737622+0.22368159j -0.08105919+0.52118366j
  -0.08105919-0.52118366j -0.08369183+0.j        ]
 [ 0.67848587+0.j          0.67848587-0.j         -0.23972077+0.51343418j
  -0.23972077-0.51343418j  0.55227929+0.j        ]
 [-0.06725205-0.04245401j -0.06725205+0.04245401j  0.5764616 +0.j
   0.5764616 -0.j          0.03702611+0.j        ]
 [-0.40693059+0.33404544j -0.40693059-0.33404544j  0.09228142-0.1189569j
   0.09228142+0.1189569j   0.78750328+0.j        ]
 [ 0.15556038-0.21669126j  0.15556038+0.21669126j  0.04278538-0.20954983j
   0.04278538+0.20954983j  0.25778076+0.j        ]]
[[ 0.38018309-0.25936132j  0.38018309+0.25936132j -0.00895911-0.34898333j
  -0.00895911+0.34898333j  0.165099  +0.j        ]
 [-0.0909622 +0.22018821j -0.0909622 -0.22018821j -0.10808725-0.2663533j
  -0.10808725+0.2663533j   0.27191942+0.j        ]
 [-0.0189589

### Discrete Fourier Transformations

In [8]:
from scipy import fft

Fourier Transform of a one dimensional sequence using both numpy and scipy:

In [9]:
def test_fft(length_):
  arr = np.random.randn(length_)

  time_start = time()
  fft_np = np.fft.fft(arr) 
  time_end = time()
  time_np = time_end-time_start

  time_start = time()
  fft_scipy = fft.fft(arr) 
  time_end = time()
  time_scipy = time_end-time_start

  print('Size: {}; norm of difference: {}'.format(length_, linalg.norm(fft_np-fft_scipy)), 
  'Time Numpy: {}'.format(time_np), 'Time Scipy: {}'.format(time_scipy), sep='\n')

for i in range(10):
  test_fft(25*(5**i))

Size: 25; norm of difference: 2.8543745438774784e-15
Time Numpy: 4.744529724121094e-05
Time Scipy: 7.271766662597656e-05
Size: 125; norm of difference: 3.044036367711843e-14
Time Numpy: 4.100799560546875e-05
Time Scipy: 7.62939453125e-05
Size: 625; norm of difference: 1.7966317663579881e-13
Time Numpy: 3.838539123535156e-05
Time Scipy: 5.364418029785156e-05
Size: 3125; norm of difference: 1.1186169367865886e-12
Time Numpy: 0.00014138221740722656
Time Scipy: 0.00012540817260742188
Size: 15625; norm of difference: 5.961121895273564e-12
Time Numpy: 0.0007188320159912109
Time Scipy: 0.0005805492401123047
Size: 78125; norm of difference: 3.325937012525871e-11
Time Numpy: 0.003635406494140625
Time Scipy: 0.0032110214233398438
Size: 390625; norm of difference: 1.8077805693016476e-10
Time Numpy: 0.014808416366577148
Time Scipy: 0.01246500015258789
Size: 1953125; norm of difference: 9.699996956506077e-10
Time Numpy: 0.09759044647216797
Time Scipy: 0.06934094429016113
Size: 9765625; norm of diff

### Statistical Functions

Cumulative distributon

In [10]:
from scipy.stats import norm
from scipy import stats

# Generate an array of sorted random values
arr = np.random.randn(125)
arr.sort()
arr_cdf = norm.cdf(arr) # creates a CDF over the array values, 
                        # showing what percentile the data at that
                        # index represents.

print(arr_cdf)

[0.03270753 0.03584904 0.03865498 0.05431399 0.06726163 0.07025335
 0.07812887 0.09306605 0.09575427 0.09744055 0.10154664 0.10194899
 0.10539715 0.10819319 0.10872254 0.11111219 0.11721703 0.11939157
 0.1340295  0.13502102 0.14823795 0.1540962  0.17819805 0.18393771
 0.18441321 0.18737112 0.18857169 0.18860746 0.19737631 0.23544284
 0.24756228 0.25258334 0.25838779 0.28423261 0.30041701 0.30064384
 0.3083243  0.3091222  0.33155045 0.36935548 0.3708499  0.37324095
 0.38701772 0.39306052 0.40032645 0.40065381 0.41094193 0.41221711
 0.41850511 0.43463176 0.43572009 0.45120727 0.451233   0.45167989
 0.4646543  0.47113994 0.47130255 0.47299229 0.4794256  0.48384792
 0.49079963 0.51202634 0.5150574  0.51567384 0.52395845 0.54710696
 0.55903115 0.56240765 0.56693833 0.56910564 0.5714272  0.57553497
 0.57673122 0.57894446 0.59527702 0.59706588 0.60449517 0.62203622
 0.62843081 0.62879244 0.62947782 0.63284395 0.64624964 0.66616851
 0.66910575 0.67476655 0.68390726 0.69938653 0.70946308 0.7107

### A bit more complicated: Interpolation

In [11]:
from scipy.interpolate import interp1d

# Testing that interpolation works on random data
def test_interp(length_):
  arr = np.random.randn(length_)
  arr.sort()
  arr_cdf = norm.cdf(arr)
  # if we interpolate the values, the interpolated CDF should be 
  # the same, because the CDF is linear.

  # create interpolation functions on the data and its cdf
  f = interp1d(np.arange(len(arr)), arr)
  f_cdf = interp1d(np.arange(len(arr_cdf)), arr_cdf)

  # interpolate array and cdf
  arr_interp = f(np.arange(0, len(arr), 2))
  cdf_interp = f_cdf(np.arange(0, len(arr_cdf), 2))

  # Test prime interpolation to make sure even interpolation isnt a fluke
  arr_interp2 = f(np.arange(0, len(arr), 7))
  cdf_interp2 = f_cdf(np.arange(0, len(arr_cdf), 7))

  # real cdf of interpolated data
  cdf_real = norm.cdf(arr_interp)

  # real cdf of (prime) interpolated data
  cdf_real2 = norm.cdf(arr_interp2)

  # error
  error = np.sum(cdf_real - cdf_interp)
  error2 = np.sum(cdf_real2 - cdf_interp2)
  print("Interpolation of CDF len {}, error: {} (prime {})".format(length_, error, error2))

for i in range(10):
  test_interp(5**(i+1)*13//12) # (pseudo) randomly increasing in magnitude

Interpolation of CDF len 5, error: 0.0 (prime 0.0)
Interpolation of CDF len 27, error: 0.0 (prime 0.0)
Interpolation of CDF len 135, error: 0.0 (prime 0.0)
Interpolation of CDF len 677, error: 0.0 (prime 0.0)
Interpolation of CDF len 3385, error: 0.0 (prime 0.0)
Interpolation of CDF len 16927, error: 0.0 (prime 0.0)
Interpolation of CDF len 84635, error: 0.0 (prime 0.0)
Interpolation of CDF len 423177, error: 0.0 (prime 0.0)
Interpolation of CDF len 2115885, error: 0.0 (prime 0.0)
Interpolation of CDF len 10579427, error: 0.0 (prime 0.0)


## 5. Additional Examples, Tutorials and Extended Learning

Depending on your application, scipy may provide the tools you need.

Check out the scipy reference [here](https://docs.scipy.org/doc/scipy/reference/index.html), and some useful tutorials [here](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) and [here](w3schools.com/python/scipy_intro.asp).

[Next time,](https://www.youtube.com/channel/UCvVAxOBEAMgu7WwTcFptt-w?sub_confirmation=1) we're going to talk about plotting using Matplotlib and Plotly, two powerful python plotting libraries that provide distinctly different capabilities.