# Introduction to Numpy
### By: Hari Patchigolla
NumPy is the fundamental package for scientific computing in Python. At its core, Numpy gets its fame within the data science community from the ```ndarray``` data structure. But why is the ```ndarray``` more powerful than the built-in Python ```list```? (Note: Throughout this notebook I go over a lot of functions, if you click the header it wil redirect you to the offical documentation of the function)

![image.png](attachment:824c3481-2d72-4f2a-8e67-6e141e6f3e53.png)

In [None]:
import numpy as np

![image.png](attachment:999e7e3d-12cd-4d0b-aa02-7bf2440f89a4.png)

In [None]:
import sys
lst = list(range(100000))
arr = np.array(lst)
print(sys.getsizeof(arr))
print(sys.getsizeof(lst))

# Arrays (Basic)

In this notebook I will cover the following core topics. As you go through the notebook you will learn a lot of different numpy functions and how powerful numpy really is!

- Creating arrays
- Indexing/slicing
- Array Manipulation
- MATH
    - Vectorization/Brodcasting

# Creating Arrays

Since the ```ndarray``` the core of numpy learning the different ways to declare a numpy array already puts you in a great position to mastering the numpy package

#### [```np.empty()```](https://numpy.org/doc/stable/reference/generated/numpy.empty.html)

You might never acutally use this function but it helps understand what numpy is doing. Te first paramter is a ```list``` that takes in a dimension of the array you want. This function will allocate the space for this array, but it will not initalize these values. The large numbers you see are called [garbage values](https://unacademy.com/content/question-answer/gk/what-is-garbage-value/#:~:text=Answer%3A%20The%20Garbage%20value%20is,value%20is%20given%20to%20it.). 

In [None]:
np.empty([14], dtype=int)

In [None]:
np.empty([3, 5], dtype=int)

#### [```np.zeros()```](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html)

This function does something very similar to ```np.empty``` but it initalizes all the values to 0. Note however that in order to specifiy the output shape of the array you pass in a ```tuple```

In [None]:
arr = np.zeros(shape=(3,4), dtype='int64')
arr

#### ```np.ones()```

This function does something very similar to ```np.empty``` but it initalizes all the values to 1. To specifiy the output shape of the array you pass in a ```tuple``` just like with ```np.zeros```.

In [None]:
arr = np.ones(shape=(3,4), dtype='int64')
arr

#### [```np.full()```](https://numpy.org/doc/stable/reference/generated/numpy.full.html)

This function creates an array of size ```shape``` and intializes each value to ```fill_value```

In [None]:
arr = np.full(shape=(6, 10), fill_value=10)
arr

#### [```np.array()```](https://numpy.org/doc/stable/reference/generated/numpy.array.html)

Now we are getting to a core function. This function allows up to actually make a numpy array and initialize the values to any values we want to. A simple way to do this is to pass in a list.

In [None]:
arr = np.array([0,1,2,3,4])
arr

You can also create an array using the [```range```](https://pynative.com/python-range-function/#:~:text=Python%20range()%20function%20generates,iterate%20using%20a%20for%20loop.) object

In [None]:
np.array(range(5))

The default type of an array is ```int32```. This basically means that each element is a 32 bit integer. Note that the [```.dtype```](https://numpy.org/doc/stable/reference/arrays.dtypes.html#:~:text=A%20data%20type%20object%20(an,%2C%20Python%20object%2C%20etc.)) attribute gives you the data type of the array. You can learn more about the dtat types in numpy [here](https://numpy.org/doc/stable/user/basics.types.html).

In [None]:
arr.dtype

If you want to override the default ```int32``` dtype you can pass in dtype you want into the ```dtype``` paramter

In [None]:
arr = np.array([0,1,2,3,4], dtype=np.int64)
arr

In [None]:
arr.dtype

##### 2D arrays

To create a 2D array (with rows and columns) you use the following pattern

In [None]:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr

In [None]:
arr = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
arr

When working with multidimentional array the [```.shape```](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html) attribute will return the dimensions of the array.
<br>
<br>
Below, the ```arr``` array has 4 rows and 3 columns

In [None]:
arr.shape

##### 3D arrays

![image.png](attachment:c41e22fc-0845-4fae-9cda-f7ddc652f854.png)
<br>
Image source: https://realpython.com/numpy-array-programming/
<br>
3D arrays are far more complicated to create by manually passing in the values and structuring the input. There are easier ways but I will not mention them in this course since working with them can get a bit commplicated. However below you will see one way to make a 3D array.

In [None]:
arr = np.array([
              [[0,1,2],[3,4,5],[6,7,8],[9,10,11]],# first axis array
              [[12,13,14],[15,16,17],[18,19,20], [21,22,23]],# second axis array
              [[24,25,26],[27,28,29], [30,31,32], [33,34,35]]
                ])# third axis array
arr

#### [```np.eye()```](https://numpy.org/doc/stable/reference/generated/numpy.eye.html)

This function return a 2d array of the identity array (very useful in linear algebra applications)

In [None]:
arr = np.eye(3)
arr

#### [```np.arange()```](https://numpy.org/doc/stable/reference/generated/numpy.arange.html)

This function is basically the numpy version of the ```range()``` method in Python. I can take in three values: ```start```, ```stop```, ```step```

In [None]:
arr = np.arange(5)
arr

In [None]:
arr = np.arange(2,10)
arr

In [None]:
arr = np.arange(0,10,2)
arr

#### [```np.linspace(start, stop, num)```](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html)

This function return an array of evenly spaced elements. More specifically it will divide up the interval between ```start``` and ```stop``` into ```num``` equal spaced. This function is super useful with make graphs (a future lesson)

In [None]:
arr = np.linspace(0,1,5)
arr

#### [```np.random()```](https://numpy.org/doc/stable/reference/random/index.html)

The np.random has many subset of functions that make it super powerful, but due to timing constraits we will not cover there. I highly encourage you to look thourgh them though. 

In [None]:
np.random.rand(2,3)

# Indexing/Slicing

Being able to access the different element within you numpy array will make you another step closer to being proficent with Numpy

In [None]:
arr = np.array([0,4,6,8,3,4])
arr

For the most part when working with 1D arrays indexing and slicing is the same as regular Python lists

In [None]:
arr[1]

In [None]:
arr[1] + arr[4]

In [None]:
arr[3:6]

##### 2D Array
When indexing 2D arrays you use the following notation ```arr[row,col]```

In [None]:
arr = np.array(
    [
        [2,4,6,4],
        [8,3,0,7],
        [5,7,1,9],
        [2,4,3,8],
        [7,8,3,1],
    ]
)
arr

In [None]:
# for rows
print("Row 1: ", arr[0])
print("Row 2: ", arr[1])
print("Row 3: ", arr[2])

In [None]:
print("Col 1: ", arr[:,0])
print("Col 2: ", arr[:,1])
print("Col 3: ", arr[:,2])

In [None]:
#row,col
arr[1,2]

In [None]:
arr[0:4, 1:3]

In [None]:
x = np.random.randint(5)
y = np.random.randint(4)
print(x, y)
arr[x,y]

##### Boolean Masking

Another commonly used method to index and array is to apply a logical statement to an array, notice the output of the following statement

In [None]:
arr > 3

Every element of arr that is greater than 3 is replaces with a ```True``` and ```False``` if otherwise. This boolen mask can be used to index the array

In [None]:
arr[arr > 3]

Boolen masking can also be used with equations. The follow subtracts each element of ```arr``` by 3 and then creates a boolen mask based on whether or not the resulting array in less than 0.

In [None]:
arr[arr - 3 < 0]

##### Negative Indexing

Negative indexing working similarly in numpy as it does in based python. A value of ```-1``` access the last element of the array

In [None]:
arr[0, -1]

In [None]:
arr[-1, :]

##### [```np.diag```](https://numpy.org/doc/stable/reference/generated/numpy.diag.html)

This functions outputs the diagonal of an array, once again very heavily used in linear algebra applications. 

In [None]:
np.diag(arr)

##### [```np.where```](https://numpy.org/doc/stable/reference/generated/numpy.where.html)

This function takes in three main inputs. The first input takes in a boolen mask, all the ```True``` values will be replaced witht he second input and all the ```False``` vlaues will be replaced by the third input.

In [None]:
np.where(arr % 2 == 0, arr * 100000, arr)

# [Array Manipulation](https://numpy.org/doc/stable/reference/routines.array-manipulation.html)

Now that you know how to create arrays and access different values in an array, the next step is to learn how to manipulate an array. How do you change values in an array, or change attributes of the array in general?

In [None]:
arr = np.array(
    [
        [9,4,6,2,1,5],
        [3,5,2,7,8,7],
        [6,2,9,0,1,3],
        [4,2,6,9,2,8],
        [5,8,3,4,7,2],
        [3,7,2,9,4,6]
    ]
)
arr

In [None]:
arr.shape

## [```.reshape()```](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html)

This method take is a shape in the form of a tuple and then resizes the array into that shape. Note that an error is thrown when the desired shae is not possible. This very useful when trying to multiply arrays. 

In [None]:
arr.reshape((18,2))

In [None]:
arr = arr.reshape((18,2))

In [None]:
arr[:,0]

In [None]:
arr[:,2].reshape((len(arr[:,0]), 1))

## [```.T```](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.T.html#numpy.ndarray.T)

This creates the transpose of the array, once a very useful feture in linear algebra

In [None]:
arr.T

## [```.ravel```](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html#numpy.ravel)

This function takes an array and returns a 1D array with all the values of the input array.

In [None]:
# arr.reshape(-1) - this does the same thing, do you know why?
np.ravel(arr)

## [```.column_stack```](https://numpy.org/doc/stable/reference/generated/numpy.column_stack.html#numpy.column_stack)
## [```.row_stack```](https://numpy.org/doc/stable/reference/generated/numpy.row_stack.html#numpy.row_stack)

Thes two functions pretty much as they imply. It takes in an iterable of arrays (make sure they are of the same shape) and returns an array with the proper stacking

In [None]:
arr1 = np.array([1,2,3,4,5])
arr2 = np.array([6,7,8,9,10])
arr3 = np.array([11,12,13,14,15])

In [None]:
np.column_stack([arr1, arr2, arr3])

In [None]:
np.row_stack((arr1, arr2, arr3))

In [None]:
arr1 = np.array([[1,2,3,4,5], [6,7,8,9,10]])
arr2 = np.array([[11,12,13,14,15], [16,17,18,19,20]])
arr3 = np.array([[21,22,23,24,25],[26,27,28,29,30]])

In [None]:
np.column_stack((arr1, arr2, arr3))

In [None]:
np.row_stack((arr1, arr2, arr3))

## [```.unique```](https://numpy.org/doc/stable/reference/generated/numpy.unique.html#numpy.unique)

Returns all the unique values of an array

In [None]:
np.unique(arr)

## [```.append```](https://numpy.org/doc/stable/reference/generated/numpy.append.html#numpy.append)

Attaches a new array to the end of an array along a certain axis 
<br>
```axis=0``` - is for rows
<br>
```axis=1``` - is for columns

In [None]:
arr2 = np.zeros((6,6))
arr2

In [None]:
np.append(arr, arr2)

In [None]:
np.append(arr, arr2, axis=0)

In [None]:
np.append(arr, arr2, axis=1)

![image.png](attachment:c7e00140-8817-4914-b4e2-abbea759b618.png)

## [```.split```](https://numpy.org/doc/stable/reference/generated/numpy.split.html)

This function is neat way to divide.segment your array into a list of smaller arrays

In [None]:
np.split(arr, 2, axis=0)

In [None]:
np.split(arr, 2, axis=1)

## Bool Masking to change values

Lastly, we can use boolean masking to mutate values in an array

In [None]:
arr[arr > 4] = 999999

In [None]:
arr

# [Math](https://numpy.org/doc/stable/reference/routines.math.html)

As a previously mentioned, the primary purpose of Numpy is for mathematical computation.
<br>
<br>
Coming from the offical documetation:
<br>
```NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.```
<br>
<br>
So lets see why this is the case

In [None]:
import random
import time


$result = (\sum_{i=1}^{1000000} x_i + 3) + (\sum_{i=1}^{1000000} y_i * 4)$

<br>
<br>

Notice how there is a pretty significant different in computation time when using in base Python versus Numpy

In [None]:
random.seed(23)
x = random.sample(range(0, 5000000), 1000000)
y = random.sample(range(0, 5000000), 1000000)

start = time.time()
sumx = 0
for i in range(0, len(x)):
  sumx = sumx + x[i] + 3

sumy = 0
for i in range(0, len(y)):
  sumy = sumy + y[i] * 4

res = sumx + sumy

end = time.time()
print("Result: ", res, "And it took: ", (end - start), "s")

In [None]:
start = time.time()
x = np.array(x)
y = np.array(y)

res = np.sum(x + 3, dtype=np.int64) + np.sum(y * 4, dtype=np.int64)

end = time.time()
print("Result: ", res, "And it took: ", (end - start), "s")

![image.png](attachment:9f20cb4f-e6a8-44ab-b2ff-95fb88d62073.png)
<br>
<br>
But why is it so much faster? Well outside of the resons I have already mentioned, there is another pretty important reason why. This is due to Numpy Vectorization.
<br>
<br>
 
***Due to the way numpy stores arrays in memory, it allows you to operating on a set of values (arrays) all at onse as compared to operating on them one at a time!***
 
<br>
<br>

[Numpy Vectorization Article](https://medium.com/@mikeliao/numpy-vectorization-d4adea4fc2a)
<br>
[Vectorization](https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-a-key-tool-to-improve-performance-on-modern-cpus.html)

In [21]:
import pandas as pd

Below we will be implementing multiple linear regression using numpy. Understanding the intuiton behind this is not important right now but understanding the way the equations are implemented is important. 

The dataset was downloaded from this link:
https://www.kaggle.com/datasets/mirichoi0218/insurance

![image.png](attachment:0e3a31a7-75a7-4d3f-85d6-84cf2b60effe.png)
<br>
More on Multi Lin Reg: https://www.youtube.com/watch?v=zITIFTsivN8
<br>
Image from: https://www.c-sharpcorner.com/article/multiple-linear-regression/
<br>
<br>
![image.png](attachment:78a92288-90fb-4937-91af-d944783754ad.png) 
<br>
$\beta = [\beta_0, \beta_1, \beta_2, \beta_3, ..., \beta_p]$

In [25]:
data = pd.read_csv('insurance_data.csv').values # dont worry about this for now

In [26]:
data

array([[1.90000000e+01, 0.00000000e+00, 2.79000000e+01, 0.00000000e+00,
        1.00000000e+00, 1.68849240e+04],
       [1.80000000e+01, 1.00000000e+00, 3.37700000e+01, 1.00000000e+00,
        0.00000000e+00, 1.72555230e+03],
       [2.80000000e+01, 1.00000000e+00, 3.30000000e+01, 3.00000000e+00,
        0.00000000e+00, 4.44946200e+03],
       ...,
       [1.80000000e+01, 0.00000000e+00, 3.68500000e+01, 0.00000000e+00,
        0.00000000e+00, 1.62983350e+03],
       [2.10000000e+01, 0.00000000e+00, 2.58000000e+01, 0.00000000e+00,
        0.00000000e+00, 2.00794500e+03],
       [6.10000000e+01, 0.00000000e+00, 2.90700000e+01, 0.00000000e+00,
        1.00000000e+00, 2.91413603e+04]])

In [27]:
data.shape

(1338, 6)

Do you know what these lines are doing? Note that there is nothing I am doing here that I didn't previously cover

In [28]:
bool_mask = np.random.rand(data.shape[0]) < 0.9

In [29]:
bool_mask

array([ True,  True,  True, ...,  True,  True,  True])

In [36]:
sum(bool_mask == False)

120

In [32]:
train = data[bool_mask]
test = data[np.invert(bool_mask)]

In [35]:
len(test)

120

In [37]:
X = train[:, 0:-1]
y = train[:, -1]

[```np.dot()```](https://numpy.org/doc/stable/reference/generated/numpy.dot.html)
<br>
[```np.linalg.inv()```](https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html)

In [38]:
X_t = np.transpose(X) # X.T
X_tX = np.dot(X_t,X)
X_ty = np.dot(X_t,y)
beta = np.linalg.inv(X_tX).dot(X_ty)
print(beta)

[  202.87297196  -598.60981852    31.30895365   247.3373096
 23487.51633445]


In [39]:
X_test = test[:, 0:-1]
y_test = test[:, -1]
y_pred = np.dot(X_test, beta)

In [40]:
y_test

array([ 1837.237  , 10797.3362 ,  4949.7587 ,  3393.35635, 23568.272  ,
       16577.7795 , 37165.1638 , 11073.176  , 11082.5772 , 21344.8467 ,
       11381.3254 ,  2257.47525, 10115.00885,  3385.39915, 40720.55105,
        9877.6077 , 21348.706  ,  4830.63   ,  2719.27975,  7419.4779 ,
        7731.4271 , 14901.5167 ,  1980.07   , 14001.2867 ,  1832.094  ,
       12829.4551 , 11837.16   , 17179.522  ,  9617.66245,  4237.12655,
        7749.1564 ,  1737.376  , 42124.5153 ,  3561.8889 ,  9144.565  ,
       12029.2867 , 36085.219  , 15006.57945,  9290.1395 ,  2134.9015 ,
       28950.4692 ,  9788.8659 ,  1149.3959 ,  1769.53165,  1632.03625,
        4347.02335, 13470.86   ,  2643.2685 , 11362.755  ,  8413.46305,
        3994.1778 , 13887.204  , 23807.2406 , 45863.205  , 11552.904  ,
        8428.0693 , 12129.61415,  3736.4647 , 24915.04626, 12949.1554 ,
        8527.532  ,  3410.324  , 22192.43711,  5148.5526 ,  3943.5954 ,
        9863.4718 , 29186.48236,  8604.48365, 43254.41795,  6933

In [41]:
y_pred

array([ 4273.51421824, 11760.42144514,  7930.58372235,  5356.9488075 ,
       33750.79718207, 29739.98318037, 28777.29204513, 12212.48996933,
       12002.72902756,  4594.58263458, 12704.03982157,  4964.63637135,
       11338.88477422,  6175.92550194, 33917.8416967 , 11775.14570106,
       32840.99932409,  6452.57727239,  5098.33122689,  9753.28023237,
        9783.34735468, 14215.25135284,  4739.38240088, 13582.84545482,
        4157.67108973, 27576.62643913, 12555.61800595, 31871.07998641,
       10651.96341181,  6600.45267371,  9847.54047746,  4627.91762252,
       33480.63930205,  6451.58836689, 11099.82332265, 12259.75744164,
       29688.34298135, 30021.9134595 , 10571.03881431,  4955.09618818,
       37113.22913773, 10528.48113114,  4399.7017734 ,  4389.8213576 ,
        4044.17955699,  7261.67477608, 13805.43524493,  5093.46857128,
       12285.75220078,  9980.11508068,  7876.41703334, 13917.51225131,
       34133.38012374, 33657.52778763, 12176.21779573,  9796.47339637,
      

![image.png](attachment:27af246b-d595-4842-b205-472532384576.png)

In [42]:
def mse(observed, pred):
    return np.mean((observed - pred)**2, dtype=np.float32)

In [43]:
mse(y_test, y_pred)

32712568.0

![image.png](attachment:3c4f29c5-9b58-4053-9078-f8cdf686538f.png)

In [44]:
def r_sq(observed, pred):
    num = np.sum((observed - pred)**2)
    denum = np.sum((observed - np.mean(observed))**2)
    return 1 - (num / denum)

In [45]:
r_sq(y_test, y_pred)

0.7246195828259576