# NumPy
The NumPy package implements large multi-dimensional arrays and matrices and operations over those structures. These are important for data analysis, making NumPy essential in data analytical work with Python.

Importing NumPy with a shorter alias makes it easier to refer to in code:

In [8]:
import numpy as np

## The NumPy array

A NumPy array can be **created** from an ordinary Python list:

In [11]:
ages = [12, 17, 21, 25, 81]
np_ages = np.array(ages)
print(type(np_ages))  # print the type of the created NumPy array

<class 'numpy.ndarray'>


**Operations** are performed **array-wise** on NumPy arrays (on each member of the array).

In [12]:
print(np_ages + 1)

[13 18 22 26 82]


Trying the same on a list causes an error:

In [13]:
print(ages + 1)

TypeError: can only concatenate list (not "int") to list

All elements in a NumPy array must be of the same type. If not, they are **coerced** to one type:

In [18]:
ages2 = ages
ages2[1] = "18"
np_ages2 = np.array(ages2)
print(ages2)
print(np_ages2)

[12, '18', 21, 25, 81]
['12' '18' '21' '25' '81']


In the example above, the list had a single string among numbers, which was enough for all the elements to be coerced to a string 
during the creation of a NumPy array. 

With NumPy arrays, **subsetting** works exactly the same as with Python lists:

In [22]:
print(np_ages)
print(np_ages[:3])

[12 17 21 25 81]
[12 17 21]


## The 2D NumPy array

This can be **created** from a Python list of lists.

In [26]:
ages_and_heights = [[12, 150],
                    [17, 165],
                    [21, 172],
                    [25, 170],
                    [81, 180]]
np_ages_and_heights = np.array(ages_and_heights)
print(np_ages_and_heights)
print(type(np_ages_and_heights))

[[ 12 150]
 [ 17 165]
 [ 21 172]
 [ 25 170]
 [ 81 180]]
<class 'numpy.ndarray'>


Getting **dimension information** about the array (number of rows and number of columns):

In [28]:
print(np_ages_and_heights.shape)

(5, 2)


**Subsetting** a 2D array is similar to subsetting a one-dimensional array. The subset for each dimension is specified, with a comma in between. Rows are specified first, columns last:

In [32]:
print(np_ages_and_heights[1:-1, 1])    # we want the subset to contain all rows except the first and the last 
                                      # and we want it to contain only the height column (index 1)

[165 172 170]


**Operations** on 2D arrays are, again, applied array-wise. However, if there is more than one column, the other operand can be a 1D array of the same size as the row. 

In our example, we could perform a conversion: age years to months and height centimeters to meters. The conversion factors are 12 and 0.1 and we use an array with these two values as the multiplier.

In [35]:

np_ages_and_heights * np.array([12, 0.01])

array([[ 144. ,   15. ],
       [ 204. ,   16.5],
       [ 252. ,   17.2],
       [ 300. ,   17. ],
       [ 972. ,   18. ]])

Because we have introduced a real number value (0.1), the whole 2D array is coerced to `float`.

In [37]:
type(np_ages_and_heights * np.array([12, 0.1]))

numpy.ndarray

# Basic statistics with NumPy

With the arrays, NumPy provides functions for the processing of data contained in them.

Let's start by calculating the mean, median and standard deviation of the heights in our array.

In [42]:
print(np.mean(np_ages_and_heights[:,0]))
print(np.median(np_ages_and_heights[:,0]))
print(np.std(np_ages_and_heights[:,0]))

31.2
21.0
25.2697447553


Now we use the correlation function to see if there is a correlation between heights and ages.

In [2]:
print(np.corrcoef(np_ages_and_heights[:,0], np_ages_and_heights[:,1]))

NameError: name 'np' is not defined