# NumPy 

Numpy is a is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

**Numpy** stands for numeric python.
<hr>

In data science we will want to carry out operations on entire collections of values.

Numpy can do these operations at a very low level of abstraction which makes operations very fast.

Through an example of using python lists below we can see that <span style="color:red;">python doesn't allow element wise calculations</span>. In python we could solve this problem by going through each list element one after another and calculate the operation for each index separately, but this is very inefficient and tiresome to write. 

In [1]:
python_list_one = [0,1,2,3,4,5]
python_list_two = [1,2,3,4,5,6]

<img src="Images/multiplying_python_lists_error.jpg"></img>

A solution to allow element wise calculations is to use Numpy.

<hr></hr> 

## Importing numpy

Importing Numpy will allow our script to be able to use a data type called a numpy array.

Numpy is similar to python lists except that it has an additional feature which allows preforming calculations over entire arrays (lists).

In [2]:
import numpy as np

<hr>

## Using Numpy Arrays

<span style="color:red;">Note: calculations are performed element wise</span>

Suppose we have two numpy arrays called ```np_height``` and ```np_weight```.

Numpy will allow us to calculate the bmi, a body mass index, which is a measure of body fat based on height and weight that applies to adult men and women. 

The BMI formula is ```weight / height ^ 2 ```

In the example below we can see that calculations can be performed element-wise. 

The first index in the result bmi array is performed by dividing the first element in np_list_one by the fist element in np_list_two after being squared. Then the second element in the result bmi array is performed by dividing the second element in np_list_one by the second element in np_list_two after being squared.
And so on...

<img src="Images/Elementwise_operations.jpg"></img>

<hr>

We can perform many different operations on numpy arrays. Take for example the code below:

In [3]:
np_list_one = np.array(python_list_one)
np_list_two = np.array(python_list_two)
np_product_array = np_list_one * np_list_two
print(np_product_array)

[ 0  2  6 12 20 30]


Essentially Numpy knows how to work with numpy arrays as if they are single values. 

### Note: 
Numpy only allows arrays to hold only one data type, if you try make a numpy array with a list that has multiple data types then the numpy array values will be strings.

<hr>

## Differences Between Python Lists and Numpy Arrays

If you use the addition operator on python lists then the lists get pasted together (extended).

If you use the addition operator on numpy arrays then python will then do an element wise sum of the arrays.

In [4]:
numpy_array_one = np.array(python_list_one)

python_list_addition = python_list_one + python_list_one
print(python_list_addition)

numpy_array_addition = numpy_array_one + numpy_array_one
print(numpy_array_addition)

result_mixing_types_addition = python_list_one + numpy_array_one
print(result_mixing_types_addition)

[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]
[ 0  2  4  6  8 10]
[ 0  2  4  6  8 10]


## Similarities Between Python Lists and Numpy Arrays

We can work with numpy arrays in similar ways as python lists.

In [5]:
print(numpy_array_one[0])

0


## Indexing Numpy Arrays

Specifically for numpy there is another way to do list sub-setting by using an array of booleans. 

Suppose you want to get all the values in the numpy array that are even.

In [6]:
numpy_array_one % 2 == 0

array([ True, False,  True, False,  True, False])

The result is an numpy array containing booleans: True if the corresponding value is even and false otherwise.

You can use this boolean array inside square brackets to sub-setting. 

In [7]:
numpy_array_even = numpy_array_one[numpy_array_one % 2 == 0]
print(numpy_array_even)

[0 2 4]


### Note: 
<span style="color:red">Numpy arrays cannot contain elements with different types</span>. If you try to build such a list, some of the elements' types are changed to end up with a **homogeneous list**. This is known as **type coercion**.

When **type coercion** happens with boolean values and numbers:

- True will be converted to 1

- False will be converted to 0

In [8]:
python_list_three = [True, 1 , 2, 3]
print(np.array(python_list_three))

[1 1 2 3]


In [9]:
print([True, 1, 2] + [3, 4, False])
print(np.array([True, 1, 2]) + np.array([3, 4, False]))


[True, 1, 2, 3, 4, False]
[4 5 2]


<hr>

## Numpy Array Type 

Numpy array is a ```numpy.ndarray``` type. 

- The ```numpy.``` part of the type tells you its a type that was defined in the numpy package.

- The ndarray part of the type stands for N-dimensional array.

In [10]:
numpy_array = np.array([1,2,3])
type(numpy_array)

numpy.ndarray

<hr>

## Multi-Dimensional Numpy Array

You can think of the 2 dimensional numpy array as an improved list of lists where you can perform calculations on the arrays and you can do more advanced ways of subsetting. 

Each subset in the list, corresponds to a row in the two dimensional numpy array.

In [11]:
python_2d_list = [[1,2,3],[4,5,6]]
np_2d_array = np.array(python_2d_list)
print(np_2d_array)

[[1 2 3]
 [4 5 6]]


### np.shape()

Shape is an attribute and a function of numpy of a numpy array that can give you information about what the data structure looks like.

Calling np.shape() we can see the shape of a np.array

In [12]:
print(np_2d_array)
print("np_2d_array shape:", np_2d_array.shape)
print("np_2d_array shape:", np.shape(np_2d_array))

[[1 2 3]
 [4 5 6]]
np_2d_array shape: (2, 3)
np_2d_array shape: (2, 3)


We can see that we have two rows and 3 columns.
<hr>

## Multi-Dimensional Numpy Array Rules: 

Similar for 2d numpy arrays can only contain a single type.

If you have one of the elements being a string while the rest are floats or ints, all of the elements will be **coerced** to strings to end up with a **homogenous** array. 
<hr>

## Subsetting A Multi-Dimensional Numpy Array

Suppose you want the first row and then the third element in that row. 

To select a row we use square brackets after referencing the variable name.

In [13]:
print(np_2d_array)
print(np_2d_array[0]) # select a row 

[[1 2 3]
 [4 5 6]]
[1 2 3]


To select the first row and third element you can extend the same call with another pair of brackets. 
You select the row and then from that row you can do another selection.

In [14]:
print(np_2d_array)
np_2d_array[0][2]

[[1 2 3]
 [4 5 6]]


3

There is also another way of subsetting using single square brackets and using a comma. 

In [15]:
print(np_2d_array)
print(np_2d_array[0,2])

[[1 2 3]
 [4 5 6]]
3


The value before the comma specifies the row, the value after the comma specifies the column.

The intersection of the rows and columns you specified are then returned.

<img src="Images/numpy_2d_array_selection.jpg"></img>

This way of indexing numpy arrays is more intuitive and opens up more possibilities. 

Suppose you have the following list:

In [16]:
height = [1.73, 1.68, 1.71, 1.89, 1.79]
np_height = np.array(height)

weight = [65.4, 59.2, 63.6, 88.4, 68.7]
np_weight = np.array(weight)

np_2d_weights_and_heights = np.array([np_height, np_weight])
print(np_2d_weights_and_heights)

[[ 1.73  1.68  1.71  1.89  1.79]
 [65.4  59.2  63.6  88.4  68.7 ]]


Each index of the array in terms of the column number is shared as one data entry. 

Suppose we want to select the height and weight of the second and third family member.

You want both rows, so you put in a colon before the comma. Since you only want the second and third column you put the indices 1 to 3 after the comma. Remember that the third index is not included.

In [17]:
print(np_2d_weights_and_heights[:,1:3])

[[ 1.68  1.71]
 [59.2  63.6 ]]


<img src="Images/numpy_2d_array_selection_2.jpg"></img>

The intersection gives us a 2d numpy array with 2 rows and 2 columns.

Similarly you can select the weight of all family members through using the following reference. 

In [18]:
print(np_2d_weights_and_heights[1,:])

[65.4 59.2 63.6 88.4 68.7]


You only want the second row so put a 1 before the comma, and since you want all columns you use a colon after the comma. 

<img src="Images/numpy_2d_array_selection_3.jpg"></img>

The intersection gives us the entire second row.

In [19]:
print(np_2d_weights_and_heights)
print("np_2d_weights_and_heights.shape: ", np_2d_weights_and_heights.shape) # Getting the shape calling the attribute shape
print("np_2d_weights_and_heights.shape: ", np.shape(np_2d_weights_and_heights)) # Getting the shape using the function 

print(np_2d_weights_and_heights[0,2])
print(np_2d_weights_and_heights[:,1:3])
print(np_2d_weights_and_heights[1,:])

[[ 1.73  1.68  1.71  1.89  1.79]
 [65.4  59.2  63.6  88.4  68.7 ]]
np_2d_weights_and_heights.shape:  (2, 5)
np_2d_weights_and_heights.shape:  (2, 5)
1.71
[[ 1.68  1.71]
 [59.2  63.6 ]]
[65.4 59.2 63.6 88.4 68.7]


<hr>

## Multi-Dimensional Numpy Array Element-Wise Calculations 

Using multi-dimensional numpy arrays we can perform calculations that will be performed element-wise.

In [20]:
print(np_2d_array)
print(np_2d_array + np_2d_array)
print(np_2d_array - np_2d_array)
print(np_2d_array * np_2d_array)
print(np_2d_array / np_2d_array)

print(np_2d_array[:,0] + np_2d_array[:,1] ** 2)


[[1 2 3]
 [4 5 6]]
[[ 2  4  6]
 [ 8 10 12]]
[[0 0 0]
 [0 0 0]]
[[ 1  4  9]
 [16 25 36]]
[[1. 1. 1.]
 [1. 1. 1.]]
[ 5 29]


**Sum**
#### Note: 
np.sum() is less computationally expensive than python's ```sum()``` function

In [21]:
sum_col_0 = np.sum(np_2d_array[:,0])
print(sum_col_0)

5


**sort**

np.sort() is less computationally expensive than python's sort() function

In [22]:
sorted_col_0 = np.sort(np_2d_array[:,0])
print(sorted_col_0)
print(sorted_col_0.shape)

[1 4]
(2,)


## Basic Statistics With Numpy

**Mean** - average value 

In [23]:
mean_of_col0 = np.mean(np_2d_array[:,0])
print(mean_of_col0)

2.5


**Median** - middle value of a sorted list

In [24]:
median_of_col0 = np.median(np_2d_array[:,0])
print(median_of_col0)

2.5


**Standard deviation** - a measure that is used to quantify the amount of variation or dispersion of a set of data values

In [25]:
std_col_0 = np.std(np_2d_array[:,0])
print(std_col_0)

1.5


**np.corrcoef()** - Finds out if columns are correlated

In [26]:
correlation = np.corrcoef(np_2d_array[:,0], np_2d_array[:,1])
print(correlation)

[[1. 1.]
 [1. 1.]]


<hr>

## Generating Data

We can generate data using ```np.random.normal()```

In [27]:
np_height = np.round(np.random.normal(1.75,0.20, 5000), 2)
np_weight = np.round(np.random.normal(60.32, 15, 5000),2)
np_city = np.column_stack((np_height, np_weight))
print(np_city)

[[ 1.29 63.37]
 [ 1.59 46.37]
 [ 1.89 49.68]
 ...
 [ 1.69 53.04]
 [ 1.94 67.46]
 [ 1.71 48.5 ]]


### We Can Paste Two Columns Together Using 

np.column_stack()

<hr>

## Average versus median

Sometimes in data sets we will have outliers.

These outliers will drastically effect calculating the average. 

What we can do then rather than calculating the average  is calculate the median. 

It's always a good idea to check both the median and the mean, to get an idea about the overall distribution of the entire dataset.