# STA 141B Data & Web Technologies for Data Analysis

### Lecture 5, 10/15/24, Numpy


### Announcements

- Homework 2 is due this Sunday, 11:59 PM. 
- Midterm next week on Thursday, Oct 17. Sample exam is online. 

### Last week's topics

 - Basics of Python 
- Memory handling in Python
    - Reference semantics
    - Interning

### Today's topics

 - Modules and Packages
 - NumPy 

### Modules and Packages 

A module is a text file that contains Python code, usually a `.py` file. Any Python file is a module as well, its name being the file's base name without the `.py` extension. A package is a collection of Python modules. 

In R, it is called a script or package. Python's `import` command lets us load code from a module to use in our script or notebook. `import` is like a combination of R's `source` and `library` functions.

Python provides many built-in modules for common tasks (see the [list](https://docs.python.org/3/library/index.html)). Packages provide even more modules.

We can access functions from modules via `.`: 

In [None]:
pi

In [None]:
import math
math.pi

You can give imported modules an alias to cut down on typing: 

In [None]:
import math as m
m.pi

We can import single functions within a module or package by running

In [None]:
from math import pi
pi

Of course, other `math` functions are not imported this way: 

In [None]:
ceil(4.3)

In [None]:
m.ceil(4.3)

Importing only the required functions keep the namespace clean. We can import all functions from one package by running

In [None]:
from math import * # this is called wildcard import

In [None]:
ceil(4.3)

Use this with care, in particular if the namespaces of some modules collide! 

The Python style guide allows to import several functions in one line. This is an exception to the one-line-one-statement rule.

In [None]:
from math import pi, ceil

Which of the built-in modules are important?

Module      | Description
----------- | -----------
sys         | info about Python (version, etc)
pdb         | Python debugger
pathlib     | tools for file paths
collections | additional data structures
string      | string processing
re          | regular expressions
datetime    | date processing
urlparse    | tools for URLs
itertools   | tools for iterators
functools   | tools for functions

Python's built-in `math` and `statistics` modules are missing features we need for serious scientific computing, so we use the "SciPy Stack" instead.

The SciPy Stack is a collection of packages for scientific computing (marked with a `*` below). Most scientists working in Python use the SciPy Stack. The 3 most important packages in the stack are:

Package      | Description
------------ | -----------
numpy\*      | arrays, matrices, math/stat functions
scipy\*      | additional math/stat functions
pandas\*     | data frames

There are also several packages available for creating static plots.

Package      | Description
------------ | -----------
matplotlib\* | visualizations
seaborn      | "statistical" visualizations
plotnine     | ggplot2 for Python

Finally, there are many other packages we may use for specific statistical tasks. Some of these are:

Package      | Description
------------ | -----------
requests     | web (HTTP) requests
lxml         | web page parsing (XML & HTML)
beatifulsoup | web page parsing (HTML)
nltk         | natural language processing
spacy        | natural language processing
textblob     | natural language processing
statsmodels  | classical statistical models
scikit-learn | machine learning models
pillow       | image processing
scikit-image | image processing
opencv       | image processing

### NumPy
NumPy is a Python package that provides tools for numerical computing (the name stands for "Numerical Python"). If you use Anaconda, NumPy is already installed.

NumPy is documented [here][link4].

[link4]: https://numpy.org/doc/stable/

In [None]:
import numpy as np # standard naming convention for numpy

NumPy's core feature is the n-dimensional array of <kbd>numpy.ndarray</kbd> object type. NumPy arrays are the basis for almost all of Python's scientific computing packages. They are the Python equivalent of R's built-in vectors.

NumPy arrays use reference semantics, just as <kbd>list</kbd> type objects. 

#### Creating NumPy Arrays

You can create NumPy arrays from lists:

In [None]:
a = np.array([1,2,3])
a

In [None]:
type(a)

A numpy array is a grid of values, all of the same type and is indexed by a tuple of nonnegative integers. A list is the Python equivalent of an array, but is resizeable and can contain elements of different types.

In [None]:
np.array([1, 2.0, 'horse']) # recasts all entries to char

In [None]:
np.array([1,2.0,3]) # recasts all entries to float

In [None]:
x = np.array([1,2,3]) 
y = np.array([4.1, 5.2, 6.3])
x + y

In [None]:
z = [2, 3, 4] #list!
x + z # list + array = array

In [None]:
x + 4

In [None]:
x + np.array([2, 3, 4, 5])

You can create multidimensional arrays, like matrices, from nested lists.

In [None]:
m = np.array([[1, 2, 3], [4, 5, 6]])
m

In [None]:
type(m)

In [None]:
np.shape(m)

In [None]:
np.size(m)

`numpy.matrix` is a subclass of `numpy.ndarray`. It inherits the laters attributes and is restricted to two dimensions. There is no N-dimensional matrix in numpy, only N-dimensional array: see [this](https://numpy.org/doc/stable/reference/arrays.ndarray.html).

In [None]:
m = np.matrix([[1, 2, 3], [4, 5, 6]])
m

In [None]:
type(m)

In [None]:
np.shape(m)

NumPy also provides several helper functions to create arrays. See the [documentation](https://numpy.org/doc/stable/user/basics.creation.html) for a full list.

In [None]:
np.arange(0, 10, 2)

In [None]:
np.array(range(0, 10, 2))

In [None]:
np.ones(10)

In [None]:
np.zeros(10)

#### Inspecting Arrays

The array attributes `.shape` and `.size` contain information about the structure of the array. A plethora of array methods are also provided. 

In [None]:
x

In [None]:
np.size(x)

In [None]:
x.shape # returns a tuple! 

In [None]:
m.shape

In [None]:
x.size # like R's length()

In [None]:
m.size

In [None]:
len(x) # not as neat, but gives same result

In [None]:
x.sum()

In [None]:
sum(x) # not as neat, but gives same result

The array attribute `.dtype` contains the data type of the array's elements.

In [None]:
x.dtype

In [None]:
x

In [None]:
y

In [None]:
y.dtype

In [None]:
(x + y).dtype

See [here][link5] or [here][link6] for a complete list of Numpy data types.

[link5]: https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html
[link6]: https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html#NumPy-Standard-Data-Types 

 The pythonic role of packages is to not only contribute one class (e.g., `ndarray`), but more functions related to the packages role. This is why a variety of additional functions are provided as well.  

In [None]:
np.sin(x)

Keep in mind that this is not the same as the `math.sin` function we have imported above. Indeed, this function throws an error:

In [None]:
sin(x) # or math.sin(x)

#### Indexing

You can subset NumPy arrays with indexes or Boolean arrays. Again, this is similar to R.

But be careful! Python uses `and` and `or` to combine conditions, but NumPy uses `&` and `|`!

In [None]:
x

In [None]:
x[0]

In [None]:
x[1]

In [None]:
x == x[0:]

In [None]:
m

In [None]:
m.shape

In [None]:
m[1,0]

In [None]:
x % 2 == 0

In [None]:
x[x % 2 == 0] # true or false

Note that although only one element is selected, a one-dimensional array is returned! 

In multidimensional arrays, separate indexes for each dimension with commas. The bare slice `:` selects everything in one dimension, just as we have seen in the last lectures. 

In [None]:
m

In [None]:
m[:, 0]

In [None]:
m[1, :]

In [None]:
m[:, 0:2]

In [None]:
m[0,:]

Matrix algebra can performed using `@`. 

In [None]:
m

In [None]:
(np.transpose(m) @ m ).shape

In [None]:
m @ np.transpose(np.matrix([[1, 2, 3]]))

So what is NumPy good for?

NumPy also provides functions for:
- Linear algebra (multiplication, transposition, decomposition, ...)
- Random number generation
- Elementary statistics
- Signal processing
- And more...

There isn't time to cover all of these in detail in lecture, but you can learn more from the documentation and references.

## Using NumPy for Monte Carlo Integration
### Example - approximate $\pi$ using Monte Carlo method

Consider a circle with radius 1 circumscribed by a square with side length 2.

The area of the circle is $\pi$, so for a uniform distribution on the square, the probability a point will fall in the circle is $\pi / 4$.

We can estimate the probability to estimate $\pi$.

In [None]:
import numpy as np

def approx_pi(n): 
    # First, we need to sample points (x, y) with x in (0, 1) and y the same
    u = np.random.rand(n, 2) # returns array of dim n # uniform distribution from 0 to 1

    # Check whether sampled point lies in the circle
    in_circle = u[:,0]**2 + u[:,1]**2 <= 1

    # Estimate pi
    pi = 4 * in_circle.mean()
    return pi

In [None]:
approx_pi(1_00_000)

In [None]:
np.pi

In [None]:
math.pi

We have used the package `numpy.random`, a subpackage or NumPy. 

In [None]:
help(np.random)

There are other packages that produce random numbers, like `random`. 

In [None]:
import random
random.seed(1) # np.random.seed(1) # alternative, but different seed! 

Lets see how how fast our approximation converges. 

In [None]:
log10_nsims = np.linspace(start=0.5, stop=6, num=100)

In [None]:
nsims = (10**log10_nsims).astype(int)

In [None]:
nsims

In [None]:
pi = [approx_pi(n) for n in nsims]
pi

Lets visualise this numerical convergence. 

In [None]:
import matplotlib.pyplot as plt
plt.plot(log10_nsims, pi)
plt.axhline(y=np.pi, color='r', linestyle='-') # add red line 
plt.xlabel("$\log_{10}($nsims$)$") # add label with LaTex
plt.show() 