## Why Numpy

In [week2]() we learnt the built-in data structure of container types in Python: lists, dictionaries, sets, and tuples. Specifically, we learnt that:

- List are mutable types and can contain values with various data types including another list (nested list).

It means that if we want to represent a data object for matrix types (rows and columns), we can simply make use of the nested lists for the purpose.

However,
- Lists don’t support mathematical operations like element wise addition and multiplication.

- Lists can contain objects of differing types, which means that every element in a list needs to be tracked for its type during computation that can seriously deteriorate code execution speed.

`Numpy` is a Python library developed for a faster and efficient numerical and scientific computation in Python. It provides a data structure optimized for high-performance vector and matrix computation. which significantly optimizes computation with high performance and enhanced speed of execution.

<ul><li>
Numpy provides faster way of performing numerical computation.

<ul> <li> As numpy is optimized for matrix and algebraic operations and is implemented in C, it provides better speed performance compared to pure python implementation of such operations.
</ul>
<li>It integrates very well with other popular libraries such as Scipy, matplotlib, and pandas.
 
<li>It has a whole range of optimized functions such as linear algebra operations built-in.



## Numpy data structure

What is a numpy array?

- A numpy array (like a grid) is a multidimensional container of homogenous values, i.e., all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:
- Technically,
It is an object that points to a block of memory, keeps track of the types, dimensions, length, and strides (the number of bytes to step in each dimension when traversing the array). Read [here](https://www.jessicayung.com/numpy-arrays-memory-and-strides/) for more.  




### How to create a numpy array?


### From sequence like data
The easiest way of creating an array is calling array function on a sequence like data such as list.


In [1]:
import numpy as np
l = [[2,3,5]]
arr_1 = np.array(l)
print (arr_1.dtype)

int64


Optionally, data types can be specified while creating an array.


In [2]:
lst = [2.0,3,5]
arr_1 = np.array(lst, np.int16)


 Numpy array is statically typed i.e. type is either implicitely or explicitely determined when the array is created.

A nested list can be passed in order to create a multidimensional array.

In [3]:
lst = [[2,3,5],[4,5,6]]
arr_2 = np.array(lst, dtype = np.float32)


In [4]:
#Use dtype attribute to find the types of the array.
print(arr_1.dtype, arr_2.dtype)

int16 float32


In [5]:
#Use shape() method to determine the shape of the array.
rows, cols = arr_2.shape
print (rows,cols)

2 3


### Using other functions

Numpy also provides many other functions to create arrays:


In [6]:
# Create an array of all zeros
arr_zeros = np.zeros((3,3))   
print(arr_zeros)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [7]:
# Create an array of all ones
arr_ones = np.ones((3,3))   
print(arr_ones)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [8]:
# Create an array with random values
arr_rand = np.random.random((3,3))  
print(arr_rand )

[[0.8841099  0.29462162 0.34423853]
 [0.95709502 0.12864758 0.86532073]
 [0.31898119 0.85862457 0.0020604 ]]


In [9]:
# create with arange using start, stop, and step arguments

np.arange(0, 10, 1) 


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
# using linspace with start, end, and number of elements arguements
np.linspace(0, 5, 10)

array([0.        , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
       2.77777778, 3.33333333, 3.88888889, 4.44444444, 5.        ])

### From text files

In the previous week, we learnt how to read data from a text file into Python. Similarly, text files can also be read to create Numpy arrays using genfromtxt function.

In [11]:
CSV_FILE = "/home/asimbanskota/t81_577_data_science/weekly_materials/week5/files/city.csv"

In [12]:
from numpy import genfromtxt
data = genfromtxt(CSV_FILE,delimiter=',', skip_header = 1)
print(data[0:4:])

[[  0.  41.  80.  nan  nan]
 [  1.  42.  97.  nan  nan]
 [  2.  46. 120.  nan  nan]
 [  3.  42.  71.  nan  nan]]


As the fourth and fifth columns of the table had string data types, they were imported as NaN (not a number) values. 



`savetxt` method can be used to write a numpy array into a file

In [13]:
from numpy import savetxt
CSV_WRITE_FILE = "/home/asimbanskota/t81_577_data_science/weekly_materials/week6/files/city_write.csv"
savetxt(CSV_WRITE_FILE, data)

### Indexing Array
Individual elements in an array can be indexed and sliced using indices inside a square bracket

In [14]:
# Access the value of a 4th row and 3rd column in the data array
data[3,2]

71.0

In [15]:
#Access all columns but only rows from third to fivth columns

In [16]:
data_sub = data[2:5:]
print(data_sub)

[[  2.  46. 120.  nan  nan]
 [  3.  42.  71.  nan  nan]
 [  4.  43.  89.  nan  nan]]


**IMP**: Slicing an array using index doesn't create a new array. It will just create a reference that points to the specific portion of the original array. To check whether that is the case, use the base attribute of the array.

In [17]:
data_sub.base is data

True

If the value of the sliced array is changed, the original array also get altered.

In [18]:
import numpy as np
data_sub[0,1] = np.nan
print(data[2:5:])

[[  2.  nan 120.  nan  nan]
 [  3.  42.  71.  nan  nan]
 [  4.  43.  89.  nan  nan]]


To create a separate array, use either `copy()` method or use `fancy indexing`

In [19]:
data_sub = data[2:5:].copy()
data_sub.base is data

False

### Fancy Indexing
Fancy indexing is like the simple indexing we've already seen, but we pass arrays or list of indices in place of single scalars. Note the base attribute in the new array is not the original array now, that means altering the values of the subset array would not alter the original array.

In [20]:
row_ind = [2,3,4]
col_ind = [0,1,2]
data_sub = data[row_ind,col_ind]
data_sub.base is data

False

In [21]:
data_sub = data[2:5, range(0,5)]
data_sub.base is data

False

One common use of fancy indexing in data science is to split data randomly into training and validatation dataset

In [22]:
from numpy.random import randint
row, col = data.shape
# 20 percent as validataion data

val_row_indices  = np.random.choice(range(row), int(row* 0.2), replace=False).tolist() 
train_row_indices = list(set(range(row)) - set(val_row_indices))
val_data = data[val_row_indices, :]
train_data = data[train_row_indices, :]

In [23]:
print(train_data.shape)
print(val_data.shape)
print(data.shape)

(103, 5)
(25, 5)
(128, 5)
