# <img style="float: left; padding-right: 100px; width: 300px" src="images/logo.png">ParrotAI IPT 2019:


## Python Numerical Stack
**Authors:** Faustine


---

## 2 Numpy Basics

NumPy is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom
number generators.
In scikit-learn, the NumPy array is the fundamental data structure. scikit-learn takes in data in the form of NumPy arrays. Any data you’re using will have to be converted to a NumPy array. 
The core functionality of NumPy is the ndarray class, a multidimensional (n-dimensional) array. All elements of the array must be of the same type. 

It contains among other things:

* a powerful N-dimensional array/vector/matrix object
* sophisticated (broadcasting) functions
* function implementation in C/Fortran assuring good performance if vectorized
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Also known as *array oriented computing*. The recommended convention to import numpy is:

In [1]:
import numpy as np

### 2.1 Creating numpy arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples or
* using functions that are dedicated to generating numpy arrays, such as arange, linspace, empty,zeros etc.

In [4]:
client_age = [80, 60,18, 32]
a = np.array(client_age)
a

array([80, 60, 18, 32])

In [5]:
m = np.array([[80,1.2],[60,2.5], [18,1], [32,0.5]])
m

array([[80. ,  1.2],
       [60. ,  2.5],
       [18. ,  1. ],
       [32. ,  0.5]])

For larger arrays it is inpractical to initialize the data manually, using explicit python lists.
Instead we can use one of the many functions in numpy that generates arrays of different forms.

Some of the more common are:

* np.arange;
* np.linspace;
* np.logspace;
* np.diag;
* np.zeros;
* np.ones;
* np.empty;


In [8]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 10
# (this is similar to the built-in range() function)

np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [11]:
# Create an array of five values evenly spaced between -1 and 1
np.linspace(-1,1,5)

array([-1. , -0.5,  0. ,  0.5,  1. ])

In [13]:
# Create a length-10 integer array filled with zeros
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [14]:
#Create a 3x2 floating-point array filled with 1s
np.ones((3,2))

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [26]:
# Create an uninitialized array of 2 by 3
# The values will be whatever happens to the already at that
# memory location
np.empty((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

## Basic array manipulations

#### 1. Attributes of arrays
Determining the size, shape, memory consumption, and data types of arrays

#### 2. Indexing of arrays
Getting and setting the value of individual array elements

#### 3. Slicing of arrays
Getting and setting smaller subarrays within a larger array

#### 4. Reshaping of arrays
Changing the shape of a given array

#### 5.Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array int

## Attributes of arrays

In [27]:
m

array([[80. ,  1.2],
       [60. ,  2.5],
       [18. ,  1. ],
       [32. ,  0.5]])

In [79]:
m.shape

(4, 2)

In [80]:
m.size   #number of elements

8

In [81]:
m.flatten()
m

array([[80. ,  1.2],
       [60. ,  2.5],
       [18. ,  1. ],
       [32. ,  0.5]])

In [33]:
a=a.reshape(4,1)
a.shape
a

array([[80],
       [60],
       [18],
       [32]])

In [83]:
x = np.random.rand(10, 2)
print(x)

[[0.36288856 0.00131033]
 [0.85105225 0.95705178]
 [0.10608444 0.08564436]
 [0.44077163 0.17949444]
 [0.89745351 0.81482465]
 [0.8631029  0.84322346]
 [0.19491716 0.53840316]
 [0.33841919 0.89189023]
 [0.94124628 0.99903544]
 [0.30747003 0.13518552]]


In [84]:
x.flatten()

array([0.36288856, 0.00131033, 0.85105225, 0.95705178, 0.10608444,
       0.08564436, 0.44077163, 0.17949444, 0.89745351, 0.81482465,
       0.8631029 , 0.84322346, 0.19491716, 0.53840316, 0.33841919,
       0.89189023, 0.94124628, 0.99903544, 0.30747003, 0.13518552])

<div class="alert alert-success">
    <b>Activity 1</b>: Create a vector with values ranging from 10 to 49 with steps of 1
</div>

## Array Indexing: Accessing Single Elements

In [34]:
a 

array([[80],
       [60],
       [18],
       [32]])

In [35]:
# To index from the end of the array, you can use negative indices:
a[-1]

array([32])

In [77]:
a.ndim    # number of dimensions

2

In [78]:
a.dtype    # data type

dtype('int32')

#### Random numbers and seeds

In [30]:
# uniform random numbers in [0,1]
np.random.rand(5,5)

array([[0.22451589, 0.03257123, 0.00908166, 0.17760025, 0.60477903],
       [0.67889845, 0.43981975, 0.83466912, 0.51166813, 0.14502045],
       [0.27749399, 0.38665929, 0.67226104, 0.26861481, 0.42170824],
       [0.19357725, 0.70896973, 0.43556208, 0.53488535, 0.70520624],
       [0.56411024, 0.69045547, 0.3285683 , 0.01417483, 0.92915449]])

In [90]:
# standard normal distributed random numbers
np.random.randn(5,5)

array([[-1.09823952, -0.21677049, -1.03206435, -0.16838309,  0.5569389 ],
       [ 0.81284184, -0.12711209, -0.15227752,  0.07095881,  0.01995426],
       [-1.12027011, -1.72720398,  0.88876051, -0.43780398, -0.64161034],
       [ 0.42746404,  0.83052752,  1.08599978, -0.13763766, -1.40696128],
       [ 1.15934812, -0.99142001,  0.1073511 , -0.54479126,  0.61791029]])

#### Random seed

The seed is for when we want repeatable (reproducible) results

In [88]:
np.random.seed(77)
x=np.random.rand(8,2)
print(x)

[[0.91910903 0.6421956 ]
 [0.75371223 0.13931457]
 [0.08731955 0.78800206]
 [0.32615094 0.54106782]
 [0.24023518 0.54542293]
 [0.4005545  0.71519189]
 [0.83667994 0.58848114]
 [0.29615456 0.28101769]]


#### vstack and hstack

In [37]:
x = np.ones((5, 2))
print(x)

[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]]


In [38]:
y = np.zeros((5, 2))
print(y)

[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [39]:
z = np.hstack((x,y))
print(z)

[[1. 1. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 0. 0.]]


In [40]:
z = np.vstack((x,y))
print(z)

[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [None]:
client_price = np.array([1,0.1,4,2])

In [None]:
client_price.reshape(-1,1).shape

In [None]:
np.concatenate((a.reshape(-1,1), client_price.reshape(-1,1)), 1)

### Array Slicing: Accessing Subarrays (Indexing and slicing)
Just as we can use square brackets to access individual array elements, we can also use
them to access subarrays with the slice notation, marked by the colon (:) character

To access a slice of an array x, use this:
x[start:stop:step]

If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1. 
We’ll take a look at accessing subarrays in one dimension and in multiple dimensions

In [91]:
data = np.random.randint(25,40, size=10)
print(data)

[34 37 26 28 28 32 29 34 33 36]


In [92]:
data[7:] #print the last three data

array([34, 33, 36])

In [47]:
# first five elements
print(data[:5])

[35 25 34 31 35]


In [None]:
#print  data between index 3 and 7
print(data[3:7])

In [48]:
# step is set to 2
print(data[::2])

[35 34 35 36 29]


Multidimensional array behaves like a dataframe or matrix (i.e. columns and rows).Consider the following 2D  array.

In [66]:
data = np.random.randint(25,37, size=(10,3))
print(data)

[[29 25 32]
 [34 36 35]
 [29 25 34]
 [34 30 34]
 [31 27 32]
 [28 31 36]
 [26 27 36]
 [27 29 28]
 [26 30 32]
 [25 32 25]]


In [67]:
data[:,2]

array([32, 35, 34, 34, 32, 36, 36, 28, 32, 25])

In [76]:
data[2:5,[0,2]]

array([[29, 34],
       [34, 34],
       [31, 32]])

#### Class Activity:
write a code to display the 3rd to 5th row with all the columns
   
 

In [None]:
mask = data>30
data[mask]

In [None]:
np.sqrt(data)

In [None]:
# View the first column of the array
data[:,0]

In [None]:
# View the first row of the array
data[0,]

In [None]:
# View the first two row
data[:2,]

In [None]:
#View the first  data
data[0,0]

#### Fancy indexing

In [52]:
## view all data that is less than 30
mask = data<30
data[mask]

array([28, 29])

In [53]:
if (data > 30).any():
    print("at least one element in data is larger than 30")
else:
    print("no element in data is larger than 30")

at least one element in data is larger than 30


<div class="exercise"><b>Exercise</b></div>
* Create a two-dimensional array of size $3\times 5$ and do the following:
  * Print out the array
  * Print out the shape of the array
  * Create two slices of the array:
    1. The first slice should be the last row and the third through last column
    2. The second slice should be rows $1-3$ and columns $3-5$
  * Square each element in the array and print the result

### Save and load numpy data to/ from file

In [54]:
np.save("data/sensor_data.npy",data)

In [55]:
sensor_data = np.load("data/sensor_data.npy")
print(sensor_data)

[28 38 38 31 29 34 30 38 32 38]


In [61]:
##Load from text file
sms = np.loadtxt("data/sms.txt")
sms

array([13., 24.,  8., 24.,  7., 35., 14., 11., 15., 11., 22., 22., 11.,
       57., 11., 19., 29.,  6., 19., 12., 22., 12., 18., 72., 32.,  9.,
        7., 13., 19., 23., 27., 20.,  6., 17., 13., 10., 14.,  6., 16.,
       15.,  7.,  2., 15., 15., 19., 70., 49.,  7., 53., 22., 21., 31.,
       19., 11., 18., 20., 12., 35., 17., 23., 17.,  4.,  2., 31., 30.,
       13., 27.,  0., 39., 37.,  5., 14., 13., 22.])

### calculations

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays. 

In [57]:
#mean
sms.mean()

19.743243243243242

In [58]:
#std
sms.std()

14.045352035662814

In [59]:
#min
sms.min()

0.0

In [60]:
#max
sms.max()

72.0

### Numpy calculation is element wise

In [62]:
x = np.arange(1,10)
print(x)

[1 2 3 4 5 6 7 8 9]


In [63]:
print(x+2)

[ 3  4  5  6  7  8  9 10 11]


In [64]:
#print(x**2)
np.square(x)

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81], dtype=int32)

In [65]:
np.log(x)

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
       1.79175947, 1.94591015, 2.07944154, 2.19722458])

### `Numpy `Arrays vs. `Python` Lists?

1. Why the need for `numpy` arrays?  Can't we just use `Python` lists?
2. Iterating over `numpy` arrays is slow. Slicing is faster.

One important—and extremely useful—thing to know about array slices is that they return views rather than copies of the array data. This is one area in which NumPy array slicing differs from Python list slicing.

This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

`Python` lists may contain items of different types. This flexibility comes at a price: `Python` lists store *pointers* to memory locations.  On the other hand, `numpy` arrays are typed, where the default type is floating point.  Because of this, the system knows how much memory to allocate, and if you ask for an array of size $100$, it will allocate one hundred contiguous spots in memory, where the size of each spot is based on the type.  This makes access extremely fast.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Drawing" style="width: 500px;"/>

(image from Jake Vanderplas's Data Science Handbook)

Unfortunately, looping over an array slows things down. In general you should not access `numpy` array elements by iteration.  This is because of type conversion.  `Numpy` stores integers and floating points in `C`-language format.  When you operate on array elements through iteration, `Python` needs to convert that element to a `Python` `int` or `float`, which is a more complex beast (a `struct` in `C` jargon).  This has a cost.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Drawing" style="width: 500px;"/>

(image from Jake Vanderplas's Data Science Handbook)

If you want to know more, we will suggest that you read 
- [Jake Vanderplas's Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/). 
- [Wes McKinney's Python for Data Analysis](https://hollis.harvard.edu/primo-explore/fulldisplay?docid=01HVD_ALMA512247401160003941&context=L&vid=HVD2&lang=en_US&search_scope=everything&adaptor=Local%20Search%20Engine&tab=everything&query=any,contains,Wes%20McKinney%27s%20Python%20for%20Data%20Analysis&sortby=rank&offset=0) (HOLLIS)<br>
You will find them both incredible resources for this class.

Why is slicing faster? The reason is technical: slicing provides a *view* onto the memory occupied by a `numpy` array, instead of creating a new array. That is the reason the code above this cell works nicely as well. However, if you iterate over a slice, then you have gone back to the slow access.

By contrast, functions such as `np.dot` are implemented at `C`-level, do not do this type conversion, and access contiguous memory. If you want this kind of access in `Python`, use the `struct` module or `Cython`. Indeed many fast algorithms in `numpy`, `pandas`, and `C` are either implemented at the `C`-level, or employ `Cython`.

# Concept Check:
Answer these questions and see the bottom of the lab to check your answers.

1. What is the major benefit of working in `numpy`?
2. Why is it vital to use built-in numpy functions whenever possible?
3. You need to find the running total of each element in a numpy array. For example, the input \[2,5,3,2\] should give the output \[2,7,10,12\]. Rather than writing your own code, think of at least two google querrys to find a relevant numpy function.
4. What `numpy` function does the job above?

## References

- [python4datascience-atc](https://github.com/pythontz/python4datascience-atc)
- [PythonDataScienceHandbook](https://github.com/jakevdp/PythonDataScienceHandbook)
- [DS-python-data-analysis](https://github.com/jorisvandenbossche/DS-python-data-analysis)