In this mission we will learn:

* How vectorization makes our code faster.
* About n-dimensional arrays, and NumPy's ndarrays.
* How to select specific items, rows, columns, 1D slices, and 2D slices from ndarrays.
* How to use vector math to apply simple calculations to entire ndarrays.
* How to use vectorized methods to perform calculations across either axis of ndarrays.
* How to add extra columns and rows to ndarrays.
* How to sort an ndarray.

# Why Python, Numpy and Pandas?

Python is a high-level programming language that allows you to do a lot of things with a few lines of code. 
When you write code in Python, you don't have to worry about things like allocating memory on your computer or choosing how certain operations are done by your computer's processor. Python takes care of that for you.

The problem is that you don’t have control over performance because the interpreter makes the decisions on how to execute your instructions.

On the other side, C is a low level programming language that give you the possibility to write highly efficient application, but with a lot of code.

=> tradeoff between coding your app faster using python and making it more performant using C.

=> Fortunately, numpy and pandas are created to fill this gap and to give us the best of both worlds. The two libraries allow us to write code quickly without sacrificing performance by using vectorization.

### How Vectorization Makes Code Faster

![image.png](attachment:image.png)

When this code is run, the Python interpreter will turn our code into bytecode, following the logic of our for loop. In each iteration of our loop, the bytecode asks our computer's processor to add the two numbers together and stores the result.
So, Our computer would take eight processor cycles to process the 8 rows of of our data.


![image.png](attachment:image.png)

**Vectorization** takes advantage of a processor feature called **Single Instruction Multiple Data (SIMD)** to process data faster. Most modern computer processors support SIMD. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle.

![image.png](attachment:image.png)

Depending on the capabilities of the processor and the size of each data point, Vectorized operations are able to process hundreds of operations per processor cycle.

### import numpy

In [3]:
import numpy as np

# 2. Understanding NumPy ndarrays

In [14]:
l=[[1,2,3],[4,5,6],[7,8,9],[11,12,13]]
# convert a list into a NumPy n-dimensional array(ndarray)
np_arr=np.array(l)
np_arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [11, 12, 13]])

N-dimensional refers to the fact that ndarrays can have one or more dimensions. Let's look at some visualizations of one, two, and three dimensional arrays and their common names:
![image.png](attachment:image.png)

In [15]:
np_arr.shape

(4, 3)

The output of the ndarray.shape attribute gives us a few important pieces of information:

* There are two numbers, which tells us that our ndarray is two-dimensional.
* The first number tells us that the first dimension is 89,560 items long, or put another way that there are 89,560 rows in our data set.
* The second number tells us that the second dimension is 15 items long, or put another way that there are 15 columns in our data set.

# 3. Selecting and Slicing Rows and Items from ndarrays

slicing with numpy is very similar to python. Exept for selecting a single value from ndarray.

![image.png](attachment:image.png)

In [20]:
l=[[1,2,3],[4,5,6],[7,8,9],[11,12,13]]
# select a value from a list of list in python
l[0][0]

1

In [21]:
# select a value from ndarray
np_arr=np.array(l)
np_arr[0,0]

1

##### select all the rows exept the first one:

In [31]:
# ndarray[row,column]
np_arr[:2,:]
# this will select the first and the second rows.
# NB: the column nbr 2 is not included

array([[1, 2, 3],
       [4, 5, 6]])

The result of the selection can be one of the following:

* **An integer**, indicating a specific location, eg ndarray[3,0].
* **A slice**, indicating a range of locations, eg ndarray[0:5,6:].
* **A colon**, indicating every location, eg ndarray[:,2].
* **A list of values**, indicating specific locations, eg ndarray[[0,1,3,4],0].
* **A boolean array**, indicating specific locations.
* Or any combination of the above.

##### select the third column

In [24]:
np_arr[:,2]

array([ 3,  6,  9, 13])

![image.png](attachment:image.png)

#### select multiple column

In [28]:
np_arr[:,[0,2]]

array([[ 1,  3],
       [ 4,  6],
       [ 7,  9],
       [11, 13]])

# 4. numpy vector operations and arithmetic functions:

In [4]:
# add the first and the second columns and store the result in a new array:
my_numbers = [
              [6, 5],
              [1, 3],
              [5, 6],
              [1, 4],
              [3, 7],
              [5, 8],
              [3, 5],
              [8, 4]
             ]
# python way
sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)
print("python way", sums)
# convert the list of lists to an ndarray
my_numbers = np.array(my_numbers)

np_sums=my_numbers[:,0]+my_numbers[:,1]
print("numpy way",np_sums)

python way [11, 4, 11, 5, 10, 13, 8, 12]
numpy way [11  4 11  5 10 13  8 12]


![image.png](attachment:image.png)

We can use any of the standard Python numeric operators to perform vector math:

* vector_a + vector_b - Addition
* vector_a - vector_b - Subtraction
* vector_a * vector_b - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
* vector_a / vector_b - Division
* vector_a % vector_b - Modulus (find the remainder when vector_a is divided by vector_b)
* vector_a ** vector_b - Exponent (raise vector_a to the power of vector_b)
* vector_a // vector_b - Floor Division (divide vector_a by vector_b, rounding down to the nearest integer)

or, we can use numpy arithmetic operations like: np.divide()

# 5. Calculating Statistics For 1D ndarrays

Numpy ndarrays have methods for many different calculations. A few key methods are:

* ndarray.min() to calculate the minimum value
* ndarray.max() to calculate the maximum value
* ndarray.mean() to calculate the mean average value
* ndarray.sum() to calculate the sum of the values

In [10]:
arr=np.arange(10)
print(arr.sum())# use the object method
print(np.max(arr)) # use numpy function

45
9


# Calculating Statistics For 2D ndarrays

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [21]:
arr_2d=np.random.random((3,2))
arr_2d

array([[0.63815744, 0.43571488],
       [0.72250289, 0.96096811],
       [0.53096338, 0.24885084]])

In [18]:
arr_2d.sum()

3.0257971422679315

In [19]:
arr_2d.sum(axis=0)

array([1.77564804, 1.2501491 ])

In [20]:
arr_2d.sum(axis=1)

array([1.44768443, 0.613978  , 0.96413472])

# 6. Adding Rows and Columns to ndarrays using numpy.concatenate()

### concatenate rows and columns:

In [28]:
zeros=np.zeros((2,2),dtype=int)
ones=np.ones((2,2),dtype=int)

In [30]:
combined_by_rows=np.concatenate([zeros,ones],axis=0)
combined_by_rows

array([[0, 0],
       [0, 0],
       [1, 1],
       [1, 1]])

In [31]:
combined_by_cols=np.concatenate([zeros,ones],axis=1)
combined_by_cols

array([[0, 0, 1, 1],
       [0, 0, 1, 1]])

### expand dimensions:

In [40]:
zeros=np.zeros(3)
np.expand_dims(zeros,axis=0)

array([[0., 0., 0.]])

# 7. Sorting ndarrays

In [45]:
# numpy.argsort() function returns the indices which would sort an array.
# we can use it to sort a dataset by a particular column
colors=np.array(['red','blue','yellow','brown','black'])
sorted_order = np.argsort(colors)
colors_sorted = colors[sorted_order]
colors_sorted

array(['black', 'blue', 'brown', 'red', 'yellow'], dtype='<U6')

# lesson 2: Boolean Indexing with NumPy

In this mission we will learn:

* How to use numpy.genfromtxt() to read in an ndarray.
* About NaN values.
* What a boolean array is, and how to create one.
* How to use boolean indexing to filter values in one and two-dimensional ndarrays.
* How to assign one or more new values to an ndarray based on their locations.
* How to assign one or more new values to an ndarray based on their values.

### import csv file:

In [48]:
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',')

In [49]:
taxi

array([[      nan,       nan,       nan, ...,       nan,       nan,
              nan],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 1.165e+01, 6.999e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 8.000e+00, 5.430e+01,
        1.000e+00],
       ...,
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 5.000e+00, 6.334e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 8.950e+00, 4.475e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 0.000e+00, 5.484e+01,
        2.000e+00]])

### NaN?

**NaN** is an acronym for **Not a Number**. The concept of NaN is an unusual one at first - it literally means that the value cannot be stored as a number. It's similar to python None.

### skip the header:

In [50]:
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',',skip_header=1)
taxi

array([[2.016e+03, 1.000e+00, 1.000e+00, ..., 1.165e+01, 6.999e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 8.000e+00, 5.430e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 0.000e+00, 3.780e+01,
        2.000e+00],
       ...,
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 5.000e+00, 6.334e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 8.950e+00, 4.475e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 0.000e+00, 5.484e+01,
        2.000e+00]])

In [51]:
taxi[0,:]

array([2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
       4.000e+00, 2.100e+01, 2.037e+03, 5.200e+01, 8.000e-01, 5.540e+00,
       1.165e+01, 6.999e+01, 1.000e+00])

### select data using boolean arrays:

In [53]:
print(np.array([2,4,6,8]) < 5)

[ True  True False False]


![image.png](attachment:image.png)

#### How to select specific rows using boolean array:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [58]:
taxi[taxi[:,1]==1]

array([[2.016e+03, 1.000e+00, 1.000e+00, ..., 1.165e+01, 6.999e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 8.000e+00, 5.430e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 0.000e+00, 3.780e+01,
        2.000e+00],
       ...,
       [2.016e+03, 1.000e+00, 3.100e+01, ..., 0.000e+00, 1.380e+01,
        2.000e+00],
       [2.016e+03, 1.000e+00, 3.100e+01, ..., 0.000e+00, 2.180e+01,
        2.000e+00],
       [2.016e+03, 1.000e+00, 3.100e+01, ..., 0.000e+00, 2.430e+01,
        2.000e+00]])

![image.png](attachment:image.png)

### Assigning Values in ndarrays

In [59]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()
taxi_modified[28214, 5] = 1
taxi_modified[:,0] = 16
taxi_modified[[1800,1801], 7] = taxi_modified[:, 7].mean()
taxi_modified

array([[16.  ,  1.  ,  1.  , ..., 11.65, 69.99,  1.  ],
       [16.  ,  1.  ,  1.  , ...,  8.  , 54.3 ,  1.  ],
       [16.  ,  1.  ,  1.  , ...,  0.  , 37.8 ,  2.  ],
       ...,
       [16.  ,  6.  , 30.  , ...,  5.  , 63.34,  1.  ],
       [16.  ,  6.  , 30.  , ...,  8.95, 44.75,  1.  ],
       [16.  ,  6.  , 30.  , ...,  0.  , 54.84,  2.  ]])

In [60]:
a = np.array([1, 2, 3, 4, 5])
a[a > 2] = 9

In [61]:
a

array([1, 2, 9, 9, 9])

![image.png](attachment:image.png)

In [66]:
# Assignment Using Boolean Arrays
c = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

c[c[:,1] > 2, 1] = 99
# array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
print(c)

[[ 1  2  3]
 [ 4 99  6]
 [ 7 99  9]]


![image.png](attachment:image.png)