# AIDM7330 Basic Programming for Data Science

# Introduction to NumPy

NumPy stands for Numerical Python and it is the fundamental package for scientific computing with Python.

It is a package that lets you efficiently store and manipulate numerical **arrays**.

NumPy contains an array object. The core feauture that NumPy supports is its multi-dimensional arrays. In NumPy, dimensions are called axes and the number of axes is called a rank.

In [1]:
# Install required packages using pip package manager in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy



In [2]:
import numpy as np

In [3]:
np.__version__

'1.23.5'


## Creating a NumPy Array:
### 1. Simplest possible: We use a list as an argument input in making a NumPy Array


In [4]:
# Create array from Python list
list1 = [1, 2, 3, 4]
data = np.array(list1)
print(list1)
print(data)
print(type(data))

[1, 2, 3, 4]
[1 2 3 4]
<class 'numpy.ndarray'>


In [5]:
# data = np.array(1,2,3,4, 5,6,7,8,9) # wrong
data = np.array([1,2,3,4,5,6,7,8,9]) # right
data

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [6]:
# Find out object type
type(data)

numpy.ndarray

In [7]:
# See data type that is stored in the array
data.dtype#int64:can store very large number

dtype('int64')

In [8]:
# The data types are specified for the full array, if we store
# a float in an int array, the float will be up-casted to an int
data[0] = 3.14159 #you lose information 3.14...->3
print(data)

[3 2 3 4 5 6 7 8 9]


In [9]:
# NumPy converts to most logical data type
list2 = [1.2, 2, 3, 4]
data2 = np.array(list2)
print(data2)
print(data2.dtype) # all values will be converted to floats if we have one

[1.2 2.  3.  4. ]
float64


In [10]:
# We can manually specify the datatype
list3 = [1, 2, 3]
data3 = np.array(list3, dtype=float) #manually specify data type
print(data3)
print(data3.dtype)

[1. 2. 3.]
float64


In [12]:
# lists can also be much longer
list4 = range(100001)
data = np.array(list4)
data

array([     0,      1,      2, ...,  99998,  99999, 100000])

In [13]:
len(data) # to see the length of the full array

100001

In [14]:
# see documentation, the first keyword is the object to be passed in
np.array?

More info on data types can be found here:
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

# Accessing elements: Slicing and indexing

In [15]:
# Similar to indexing and slicing Python lists:
print(data[:])
print (data[0:3])
print (data[3:])
print (data[::-2])

[     0      1      2 ...  99998  99999 100000]
[0 1 2]
[     3      4      5 ...  99998  99999 100000]
[100000  99998  99996 ...      4      2      0]


In [16]:
# more slicing
x = np.array(range(25))
print ('x:',x)
print (x[5:15:2])#interval 2
print (x[15:5:-1])

x: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24]
[ 5  7  9 11 13]
[15 14 13 12 11 10  9  8  7  6]


In [17]:
print (data[::-1]) # [start : end : step_size]

[100000  99999  99998 ...      2      1      0]


# Arrays are a lot faster than lists

In [18]:
# Arrays are faster and more efficient than lists

x = list(range(100000))
y = [i**2 for i in x]
print (y[0:5])

[0, 1, 4, 9, 16]


In [19]:
# Time the operation with some IPython magic command
print('Time for Python lists:')
list_time = %timeit -o -n 20 [i**2 for i in x]

Time for Python lists:
37.1 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)


In [21]:
z = np.array(x)
w = z**2#(each element of z **2)
print(w[:5])

[ 0  1  4  9 16]


In [22]:
print('Time for NumPy arrays:')
np_time = %timeit -o -n 20 z**2

Time for NumPy arrays:
120 µs ± 51.9 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


In [23]:
print('NumPy is ' + str(list_time.all_runs[0]//np_time.all_runs[0]) + ' times faster than lists at squaring 100 000 elements.')

NumPy is 181.0 times faster than lists at squaring 100 000 elements.


# Join, add, concatenate

In [24]:
xn = [1, 2, 3, 4, 5]
yn = [6, 7, 8, 9, 10]

In [25]:
print(xn)
print(yn)

[1, 2, 3, 4, 5]
[6, 7, 8, 9, 10]


In [26]:
# if you need to join numpy arrays, try hstack, vstack, column_stack, or concatenate
print (np.hstack((xn,yn)))

[ 1  2  3  4  5  6  7  8  9 10]


In [27]:
len(np.hstack((xn,yn)))

10

In [28]:
print (np.vstack((xn,yn)))#vertical stack

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


In [29]:
len(np.vstack((xn,yn)))

2

In [30]:
print (np.column_stack((xn,yn)))

[[ 1  6]
 [ 2  7]
 [ 3  8]
 [ 4  9]
 [ 5 10]]


In [31]:
len(np.column_stack(  (xn,yn)  ))  # 5 "pairs"

5

In [32]:
print (np.concatenate((xn, yn), axis = 0))#only one dimension

[ 1  2  3  4  5  6  7  8  9 10]


In [33]:
# the elements of an array must be of a type that is valid to perform
# a specific mathematical operation on

data = np.array([1,2,'cat', 4])#no string
print(data)
print(data.dtype)
print (data+1)  # results in error

['1' '2' 'cat' '4']
<U21


UFuncTypeError: ignored

### Creating arrays with 2 axis:


In [34]:
# This list has two dimensions
list3 = [[1, 2, 3],
         [4, 5, 6]]
list3 # nested list

[[1, 2, 3], [4, 5, 6]]

In [35]:
# data = np.array([[1, 2, 3], [4, 5, 6]])
data = np.array(list3)
data

array([[1, 2, 3],
       [4, 5, 6]])

# Attributes of a multidim array

In [36]:
print('Dimensions:',data.ndim)
print ('Shape:',data.shape)# 2 rows 3 columns
print('Size:', data.size)

Dimensions: 2
Shape: (2, 3)
Size: 6


In [37]:
# You can also transpose an array Matrix with either np.transpose(arr)
# or arr.T
print('Data:')
print(data)
print('Transpose:')#put arrary in column
print(data.T)
print('In-place or not?')
print(data) # Not in-place

# print (list3.T) # note, this would not work

Data:
[[1 2 3]
 [4 5 6]]
Transpose:
[[1 4]
 [2 5]
 [3 6]]
In-place or not?
[[1 2 3]
 [4 5 6]]


# Other ways to create NumPy arrays

In [38]:
# np.arange() is similar to built in range()
# Creates array with a range of consecutive numbers
# starts at 0 and step=1 if not specified. Exclusive of stop.

np.arange(12)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [39]:
#Array increasing from start to end: np.arange(start, end)
np.arange(10, 20)

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [40]:
# Array increasing from start to end by step: np.arange(start, end, step)
# The range always includes start but excludes end
np.arange(1, 10, 4) #1+4+4=9,and 5 is generated within the half-open interval,1+4=5

array([1, 5, 9])

In [44]:
np.arange?

In [41]:
# Returns a new array of specified size, filled with zeros.
array=np.zeros((2,5), dtype=np.int8)#about machine learning
array

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int8)

In [42]:
#Returns a new array of specified size, filled with ones.
array=np.ones((2,5), dtype=np.float64)
array

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [43]:
# Returns the identity matrix of specific squared size
array = np.eye(4)
array

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

## Some useful indexing strategies

### There are two main types of indexing: Integer and Boolean

In [46]:
x = np.array([[1, 2], [3, 4], [5, 6]])
x

array([[1, 2],
       [3, 4],
       [5, 6]])

#### Integer indexing

In [47]:
# first element is  the row, 2nd element is the column
print(x[1,0])

3


In [48]:
print(x[1:,:]) # all rows after first, all columns,left-row,right-column,middle-element

[[3 4]
 [5 6]]


In [49]:
# first list contains  row indices, 2nd element contains column indices
idx = x[[0,1,2], [0,1,1]]  # create index object ? the former is row,the latter is column
print (idx)

[1 4 6]


### Boolean indexing

In [50]:
print('Comparison operator, find all values greater than 3:\n')
print(x)
print(x>3)

Comparison operator, find all values greater than 3:

[[1 2]
 [3 4]
 [5 6]]
[[False False]
 [False  True]
 [ True  True]]


In [51]:
print('Boolean indexing, only extract elements greater than 3:\n')
print(x[x>3])#important in panda

Boolean indexing, only extract elements greater than 3:

[4 5 6]


## Extra NumPy array methods

In [52]:
# Reshape is used to change the shape
a = np.arange(0, 15)

print('Original:',a)
a = a.reshape(3, 5)
# a = np.arange(0, 15).reshape(3, 5)  # same thing

print ('Reshaped:')
print(a)

Original: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
Reshaped:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


In [53]:
# We can also easily find the sum, min, max, .. are easy
print(a)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


In [54]:
print('Sum:', a.sum())
print('Min:', a.min())
print('Max:', a.max())

Sum: 105
Min: 0
Max: 14


In [55]:
print('Sum along columns:',a.sum(axis=0))
print('Sum along rows:',a.sum(axis=1))

# Note here axis specifies what dimension to "collapse"

Sum along columns: [15 18 21 24 27]
Sum along rows: [10 35 60]


# Random numbers

In [56]:
# Random numbers
np.random.seed(0)  # set the seed to zero for reproducibility
print(np.random.uniform(1,5,10))   # 10 random uniform numbers from 1 to 5
print()
print(np.random.exponential(1,5))  # 5 random exp numbers with rate 1

[3.19525402 3.86075747 3.4110535  3.17953273 2.6946192  3.58357645
 2.75034885 4.567092   4.85465104 2.53376608]

[1.56889614 0.75267411 0.83943285 2.59825415 0.07368535]


In [57]:
print (np.random.random(8).reshape(2,4)) #8 random 0-1 in a 2 x 4 array

[[0.0871293  0.0202184  0.83261985 0.77815675]
 [0.87001215 0.97861834 0.79915856 0.46147936]]


If you want to learn more about "random" numbers in NumPy go to: https://docs.scipy.org/doc/numpy-1.12.0/reference/routines.random.html

# Aknowledgements

- The codes in this notebook are modified from various sources including Dr. Xinzhi Zhang Jupyter notebooks.
- All codes are for educational purposes only and released under the CC1.0.