<a href="https://colab.research.google.com/github/harshaljanjani/everything-ml/blob/main/Data%20Analysis%20With%20Python/Data%20Analysis%20-%20NumPy%20In-Depth%20Review%20(Day%207).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NumPy: Numeric Computing Library**

## 1) NumPy: Low-Level Basis, Binary Numbers, Memory Footprint, Why NumPy Provides For Faster Computation Than Vanilla Python

In [3]:
import sys
import numpy as np

In [4]:
a = np.array([1, 2, 3, 4])
b = np.array([0, .5, 1, 1.5, 2])
print(a[0], a[1])
print(a[0:], a[1:3])
print(b[0], b[2], b[-1])
print(a[::2])
print(b[[0, 2, -1]])

1 2
[1 2 3 4] [2 3]
0.0 1.0 2.0
[1 3]
[0. 1. 2.]


In [5]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(A.shape, A.ndim)
print(A.size)
print(A[0,2]) # Multi-Indexing

(2, 3) 2
6
3


In [6]:
# Three-Dimensional Array -> List Of List Of Lists
B = np.array([
    [[12, 11, 10],
     [9, 8, 7]],
    [[6, 5, 4],
     [3, 2, 1]]
    ])
print(B.shape)
print(B.ndim, B.size)

(2, 2, 3)
3 12


In [7]:
print(np.array([1, 2, 3, 4], dtype=np.float64))
print(np.array([1, 2, 3, 4], dtype=np.int8))
c = np.array(['a', 'b', 'c'])
d = np.array([{'a': 1}, sys])
print(d.dtype) 
print(c.dtype)

# If The Shape Isn't Consistent, It Will Just Fall Back To Regular Python Objects
C = np.array([
    [[12, 11, 10],
     [9, 8, 7]],
    [[6, 5, 4]]
], dtype = object)
print(C.dtype, type(C))
print(C.shape, C.size)
print(type(C[0]))

[1. 2. 3. 4.]
[1 2 3 4]
object
<U1
object <class 'numpy.ndarray'>
(2,) 2
<class 'list'>


## **Indexing and Slicing of N-Dimensional Matrices**

In [8]:
# Square Matrix
A = np.array([
#    0. 1. 2
    [1, 2, 3], # 0
    [4, 5, 6], # 1
    [7, 8, 9]  # 2
])
print(A[1, 0], A[0:2])
print(A[:, :2], A[:2, :2])
print(A[:2, 2:])
# Element Re-Assignment
A[1] = np.array([10, 10, 10])
A[2] = 99 # All Elements Of Row 3 Are Assigned To 99
print(A)

4 [[1 2 3]
 [4 5 6]]
[[1 2]
 [4 5]
 [7 8]] [[1 2]
 [4 5]]
[[3]
 [6]]
[[ 1  2  3]
 [10 10 10]
 [99 99 99]]


## **Summary Statistical Observations**

In [9]:
# One-Dimensional Array
a = np.array([1, 2, 3, 4])
print(a.sum())
print(a.mean())
print(a.std()) #stdev
print(a.var()) #variance
# Two-Dimensional Array
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
# Axis 0 Elements Example = [1,4,7] (Columns)
# Axis 1 Elements Example = [1,2,3] (Rows)
# Sum, Mean And Variance Across Columns and Rows
print(A.sum(axis=0))
print(A.sum(axis=1))
print(A.mean(axis=0))
print(A.mean(axis=1))
print(A.std(axis=1))
print(A.std(axis=0))

10
2.5
1.118033988749895
1.25
[12 15 18]
[ 6 15 24]
[4. 5. 6.]
[2. 5. 8.]
[0.81649658 0.81649658 0.81649658]
[2.44948974 2.44948974 2.44948974]


## **Broadcasting And Vectorized Operations (Self-Reminder: Extensively Used To Modify / Perform Arithmetic Operations On Boolean Arrays)**

The term `broadcasting` describes how `NumPy` treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is `“broadcast”` across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in `C` instead of `Python`. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to `inefficient use of memory that slows computation`.

**Source:** https://numpy.org/doc/stable/user/basics.broadcasting.html

In [10]:
a = np.arange(4); # Typically useful function when dealing with array creation routines based on numerical ranges.
print(a*10, a+10)
a += 10
print(a)
a *= 10
print(a, "\n")

# Element Wise Operations On Two Arrays
b = np.array([10, 10, 10, 10])
print(a + b)
print(a * b)

[ 0 10 20 30] [10 11 12 13]
[10 11 12 13]
[100 110 120 130] 

[110 120 130 140]
[1000 1100 1200 1300]


## **Boolean Arrays (Also Called Masks): Result Of Broadcasting Boolean Operations**

In [11]:
a = np.arange(4)
print(a)

[0 1 2 3]


In [12]:
print(a[0], a[-1])
print(a[[0, -1]])

0 3
[0 3]


In [13]:
a[[True, False, False, True]] # Selection Of Elements Based On Truth Values Of The Filter/Mask Array
# Self-Reminder: Relate This Concept To Subnet Masks In Networking For IPV4 Addresses

array([0, 3])

In [14]:
a >= 2 #Return Value Is A Mask

array([False, False,  True,  True])

In [15]:
#F ilter Elements Based On Conditions
B = a[a > a.mean()]
C = a[~(a > a.mean())]
# Array 'C' Is The Negation Of Array 'B'
print(B,C)
# Examples Of Filtering Data Using Bit-Wise Operators
print(a[(a == 0) | (a == 1)])
print(a[(a <= 2) & (a % 2 == 0)])

[2 3] [0 1]
[0 1]
[0 2]


In [16]:
A = np.random.randint(100, size=(3, 3))

In [18]:
print(A)
A[np.array([
    [True, False, True],
    [False, True, False],
    [True, False, True]
])] # Not Scalable Approach

[[88 37 78]
 [67 65 25]
 [42 65 26]]


array([88, 78, 65, 42, 26])

In [19]:
# Scalable Approaches
print(A > 30) # Returns Mask Array Of Arrays
print(A[A > 30]) # Returns Filtered Array Based On The Mask/Condition 

[[ True  True  True]
 [ True  True False]
 [ True  True False]]
[88 37 78 67 65 42 65]


## **Linear Algebra** 

In [20]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
B = np.array([
    [6, 5],
    [4, 3],
    [2, 1]
])

In [21]:
print(A.dot(B)) # Dot Product
print(A @ B) # Dot Product / Matrix Multiplication 
print(B.T) # B Transpose
print(B.T @ A) # B^T * A

[[20 14]
 [56 41]
 [92 68]]
[[20 14]
 [56 41]
 [92 68]]
[[6 4 2]
 [5 3 1]]
[[36 48 60]
 [24 33 42]]


## **Size Of Objects In Memory**

In [24]:
# An Integer In Python Is -> 24/28 Bytes
print(sys.getsizeof(1))
# A Long Integer In Python Is -> 72 Bytes
print(sys.getsizeof(10**100), "\n")
# Numpy Integer Size Is Much Smaller
print(np.dtype(int).itemsize)
print(np.dtype(np.int8).itemsize)
print(np.dtype(float).itemsize, "\n")
# Size Of Lists
print(np.array([1]).nbytes) # An Array Of One Element In NumPy
print(sys.getsizeof([1])) # One Element List

28
72 

8
1
8 

8
64


## **Performance Testing / Time Complexity Analysis**

In [35]:
# Sum Of First 100000 Numbers
l = list(range(100000))
a = np.arange(100000)
%time np.sum(a ** 2) # Operation In NumPy 
print("\n")
%time sum([x ** 2 for x in l]) # Same Operation In Vanilla Python

CPU times: user 427 µs, sys: 52 µs, total: 479 µs
Wall time: 490 µs


CPU times: user 27.8 ms, sys: 2.92 ms, total: 30.7 ms
Wall time: 30.7 ms


333328333350000

## **NumPy Useful Functions List**

In [33]:
# random()
print(np.random.random(size=2)*100) # * 100 To Generate Numbers Between 0 to 100
print(np.random.normal(size=2))
print(np.random.rand(2, 4)*100)

[85.04515568  9.1475516 ]
[1.37762927 1.82919128]
[[62.14021412 23.22574133 77.29187455  8.94887569]
 [96.42623402 34.12753564 10.3169958  82.97871463]]


In [34]:
# arange()
print(np.arange(10))
print(np.arange(5, 10))
print(np.arange(0, 1, .1))

[0 1 2 3 4 5 6 7 8 9]
[5 6 7 8 9]
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]


In [36]:
# reshape
print(np.arange(10).reshape(2, 5))
print(np.arange(10).reshape(5, 2))

[[0 1 2 3 4]
 [5 6 7 8 9]]
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]


## **linspace()** `endpoint:` `bool, optional`
If `True`, `stop` is the last sample. Otherwise, it is not included. Default value is `True`.

In [38]:
# linspace()
print(np.linspace(0, 1, 5)) # endpoint = True (by Default)
print(np.linspace(0, 1, 20)) # endpoint = True (by Default)
print(np.linspace(0, 1, 20, False)) # endpoint = False

[0.   0.25 0.5  0.75 1.  ]
[0.         0.05263158 0.10526316 0.15789474 0.21052632 0.26315789
 0.31578947 0.36842105 0.42105263 0.47368421 0.52631579 0.57894737
 0.63157895 0.68421053 0.73684211 0.78947368 0.84210526 0.89473684
 0.94736842 1.        ]
[0.   0.05 0.1  0.15 0.2  0.25 0.3  0.35 0.4  0.45 0.5  0.55 0.6  0.65
 0.7  0.75 0.8  0.85 0.9  0.95]


In [43]:
# zeroes(), ones() and empty()
print(np.zeros(5))
print(np.zeros((3, 3)))
print(np.zeros((3, 3), dtype=np.int8))
print(np.ones(5))
print(np.ones((3, 3)))
print(np.empty(5))
print(np.empty((2, 2)))

[0. 0. 0. 0. 0.]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0 0 0]
 [0 0 0]
 [0 0 0]]
[1. 1. 1. 1. 1.]
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[1. 1. 1. 1. 1.]
[[0.25 0.5 ]
 [0.75 1.  ]]


## **identity() and eye():** `k(offset):` `int, optional`
Index of the diagonal: `0` (the default) refers to the main diagonal, a `positive value` refers to an `upper diagonal`, and a `negative value` to a `lower diagonal`.

In [48]:
# identity() and eye()
print(np.identity(3))
print(np.eye(3, 3))
print(np.eye(8, 4))
print(np.eye(8, 4, k = 1)) 
print(np.eye(8, 4, k = -3))

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]]
