***
# NumPy
***
In the class discussing **Python lists** we saw how lists can be used to store several different data types. We also looked at the flexibility of lists when we changed, added and removed elements. There will be many times in your data science careers when you will want to perform operations over entire collections of values which you cannot do in lists. For example if I have a list of integers, [1,2,3,4] the list data type does not allow me to sum all the elements of the list.

In this class we are going to explore the [NumPy package](https://www.numpy.org). Let's kick off by revisiting our list of employees salaries.

In [1]:
# A list of employees salaries
salaries = [20000, 25000, 30000, 35000, 40000, 450000, 50000]


How do we sum this list? We can't, because Python cannot do calculations over lists. Instead, we use NumPy package and its **array()**. NumPy arrays are an alternative to lists which allow us to perform calculations over an entire array.

Before going any further you will need to install the NumPy package. We discussed how to do this in the previous class so go ahead now and use PIP to install NumPy. 

In [2]:
# Start by importing numpy as no
import numpy as np

# Create a new list of employees salaries
salaries2019 = [20000, 25000, 30000, 35000, 40000, 450000, 50000]

# Pass salaries2019 to numpy array
np_salaries = np.array(salaries2019)

# Output np_salaries
np_salaries

array([ 20000,  25000,  30000,  35000,  40000, 450000,  50000])

In [3]:
# Use sum() to sum values in np_salaries
total_salaries = sum(np_salaries)

# Output total_salaries
total_salaries

650000

In [4]:
# Find the average salary
average_salary = np.median(np_salaries)

average_salary

35000.0

As you can see the calculations were performed across the entire array. NumPy is an incredibly powerful tool to have at your disposal. To learn more about NumPy don't forget you can always check out the documentation.

In [5]:
help(np.median)

Help on function median in module numpy.lib.function_base:

median(a, axis=None, out=None, overwrite_input=False, keepdims=False)
    Compute the median along the specified axis.
    
    Returns the median of the array elements.
    
    Parameters
    ----------
    a : array_like
        Input array or object that can be converted to an array.
    axis : {int, sequence of int, None}, optional
        Axis or axes along which the medians are computed. The default
        is to compute the median along a flattened version of the array.
        A sequence of axes is supported since version 1.9.0.
    out : ndarray, optional
        Alternative output array in which to place the result. It must
        have the same shape and buffer length as the expected output,
        but the type (of the output) will be cast if necessary.
    overwrite_input : bool, optional
       If True, then allow use of memory of input array `a` for
       calculations. The input array will be modified by the c

With the NumPy arrays that we have been using so far they have been of only one data type, integers. You might also have noticed the speed at which Numpy can perform its calculations. Its speed comes from an assumption. NumPy assumes that your array is of only one type. An array of integers, floats and so on. In short, Numpy arrays can be of only one data type.

In [6]:
# Create a mixed NumPy array
np.array(["tony", 35000, True])

array(['tony', '35000', 'True'], dtype='<U5')

As you can see numpy converted the entire array to strings.

In [7]:
# Output type of np_salaries
type(np_salaries)

numpy.ndarray

NumPy arrays are the own data type which means, as we discussed in the last chapter that it can have its own methods. Lets look at an example:

In [8]:
# A simple list
python_list = [1,2,3]

In [9]:
# NumPy Array
numpy_array = np.array([1,2,3])

In [10]:
# Concatenate the simple list together
python_list + python_list

[1, 2, 3, 1, 2, 3]

In [11]:
# Concatenate the numpy array
numpy_array + numpy_array

array([2, 4, 6])

As you can see, Python performed an element wide sum of the numpy array. Be careful when working with data types, the output can sometimes be not what you expected. Apart from these differences you can work with numpy arrays in almost the same way that you work with python lists. For example we can use indexing which we've already learned to select elements from an array.

In [14]:
# Create a new salaries array
salaries2019 = [20000, 25000, 30000, 35000, 40000, 450000, 50000]

new_salaries = np.array(salaries2019)

new_salaries

# Select an element from the array using its index
new_salaries[1]

25000

With NumPy arrays we can also use conditional statements such as greater than **(>)** and less than **(<)**. Imagine, you've just been asked to provide an output of all your employees salaries who earn more than 30,000. How can this be done? We can use a boolean. Use this boolean array in square brackets to do sub-setting only elements that are true are selected for a new numpy array. 

In [16]:
# Create the big_salaries array
big_salaries = (new_salaries > 30000)

# Output array
print(big_salaries)

# Output all salaries greater than 30000
print(new_salaries[big_salaries])

[False False False  True  True  True  True]
[ 35000  40000 450000  50000]


## 2D NumPy Arrays

Using NumPy we can also create multi-dimensional arrays. Let's take a look by creating two new NumPy arrays, np_salaries and np_service.

In [17]:
# Create np_salaries
np_salaries = np.array([20000, 25000, 30000, 35000, 40000, 45000, 50000])

In [18]:
# Create np_service
np_service = np.array([1,2,3,4,5,6,7])

In [19]:
# Print out type
print(type(np_salaries))
print(type(np_service))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In the output above, "ndarray" stands for N-dimensional. We can create arrays of several dimensions but for now lets just focus on 2 dimensional arrays.

In [20]:
# Combine two lists together surrounded in square brackets
np_2d_array = np.array([[20000, 25000, 30000, 35000, 40000, 45000, 50000], [1,2,3,4,5,6,7]])

# Print np_2d_array
np_2d_array

array([[20000, 25000, 30000, 35000, 40000, 45000, 50000],
       [    1,     2,     3,     4,     5,     6,     7]])

As you can see the output is rectangular data structure. Each sublist of the list corresponds to a row in the 2D numpy array. We can examine the shape of the array with the following code:

In [21]:
# Shape attribute providing more info on the data structure
np_2d_array.shape

(2, 7)

Just like we did in previous example we can still perform calculations and use sub-setting. Let's say we want to select the entire first row and the third element from that row, how? We can use indexing as we have done before.

In [22]:
# Select first row of np_2d_array
np_2d_array[0]

array([20000, 25000, 30000, 35000, 40000, 45000, 50000])

In [23]:
# Select first row and the third element
np_2d_array[0][2]

30000

What we are doing here, is first selecting a row and then from that row performing another selection. We can obtain the same results by using single square brackets and a comma.

In [24]:
# Select first row and the third element
np_2d_array[0,2]

30000

In [26]:
# Select entire first row and columns 1 and 2
np_2d_array[:, 1:3]

array([[25000, 30000],
       [    2,     3]])

In [27]:
# Sum first row of np_2d_array
total_salaries = sum(np_2d_array[0])

#Print total_salaries
total_salaries

245000