## Creating ndarrays

According to the official NumPy documentation (http://docs.scipy.org/doc/numpy/user/basics.creation.html) there are five way to create arrays with NumPy.

1. By converting from another Python data type. For example a list:

In [2]:
import numpy as np

data = [3, 4, 5.5, 10]
arr  = np.array(data)
arr

array([ 3. ,  4. ,  5.5, 10. ])

If you have a list made of the same size lists, the array function will convert it into the corresponding multidimensional array:

In [3]:
data2 = [[1,2,3,4,5],[6,7,8,9,10]]
arr2 = np.array(data2)
arr2

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10]])

In [5]:
arr2.ndim

2

In [6]:
arr2.shape

(2, 5)

2. By creating a new array with zeros, ones or a square NxN matrix: 

In [7]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [8]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [10]:
np.ones((3, 6))

array([[1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.]])

In [11]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [12]:
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

3. By using a built-in library such as random (from the previous unit):


4. From files, as we will see later on in this unit.

5. From raw bytes.

## Data types

The most commonly used data types

| Data Type | Descritpion                                                                                  |
|-----------|----------------------------------------------------------------------------------------------|
| bool_     | True or False (stored as a byte)                                                             |
| int_      | Integer (similar to long in other languages)                                                 |
| int32     | Integer between -2147483648 and 2147483647                                                   |
| int64     | Integer between -9223372036854775808 to 9223372036854775807                                  |
| float_    | Floating point number i.e. a decimal number e.g. 3.45 (can also be float32 or float64)       |
| complex_  | A complex number can have real and imaginary component (can also be complex32 and complex64) |

There are other types such as string_ and object. 


In order to convert an array from one data type to another, **astype** method is used, which creates a new array. The attribute **dtype** is used to check what type an array is.

In [14]:
array = np.array([2, 3, 4])
array.dtype

dtype('int64')

In [15]:
array.astype(np.int32)

array([2, 3, 4], dtype=int32)

## Vectorisation

Vectorization in programming is the process of applying operations to an entire matrix, such as multiplying all elements of a matrix with a specific number of with another matrix.

In [17]:
array = np.array([2, 3, 4])
array*5

array([10, 15, 20])

In [19]:
array + 10

array([12, 13, 14])

If you multiple two arrays of the same length each element will mutliply the corrisponding element in the second array in turn.

In [20]:
array2 = np.array([2, 2, 1])
array*array2

array([4, 6, 4])

## Indexing and slicing 

The ndarray is similar to lists in Python, however there are some differences. Consider the following example:

In [22]:
array1 = np.array([1, 2, 3, 4, 5, 6])
array1[2:4]

array([3, 4])

If you now want to replace 2 and 3 with the number 7, you can simply do the following (this is the first distinction with lists):

In [23]:
array1[2:4]=7
array1

array([1, 2, 7, 7, 5, 6])

Another innovation of NumPy is that you can add elements in the array without the need to create a new larger array (as for instance in other programming languages):

In [24]:
array1_slice = array1[2:4]
array1_slice[1] = 500
array1

array([  1,   2,   7, 500,   5,   6])

This is very important especially when you handle big data, as there is no need of creating a new array and populating it. This is why NumPy is commonly used for scientific computing and big data wrangling. 

Arrays can be sliced in other axes. Either you can treat the 2-dimentional array as an array of arrays and slice twice **array[2][4]** or you can use two indexes **array[2, 4]**.

### Boolean Indexing
Suppose we have an array with movies and a corresponding array for their ratings.    

In [26]:
movies = np.array(['The Lord of the Rings', 'The Godfather', 'Harry Potter', 'The Pianist', 'Love Actually', 'Avatar'] )
movies

array(['The Lord of the Rings', 'The Godfather', 'Harry Potter',
       'The Pianist', 'Love Actually', 'Avatar'], dtype='<U21')

In [27]:
ratings = np.random.randint(6, size=(6,4))
ratings

array([[3, 3, 4, 0],
       [0, 4, 5, 3],
       [4, 0, 2, 0],
       [1, 4, 0, 4],
       [4, 0, 1, 2],
       [3, 4, 3, 0]])

Each line corresponds to a movie and each column to a user who rated each movie. 

If we want to check whether the movie "Harry Potter" has been rated:

In [28]:
movies == 'Harry Potter'

array([False, False,  True, False, False, False])

And in order to retrieve the ratings for this movies we simply:

In [29]:
ratings[movies == 'Harry Potter']

array([[4, 0, 2, 0]])

Be careful and use '==' not '=' which will change the values in the table. 

If you want to retrieve the rating that the 3rd rater gave to Love Actually:

In [30]:
ratings[movies == 'Love Actually', 2]

array([1])

Tip: To select more than one names you can use the & (and) operator or | (or). Similarly you can use "<" or ">" to check values greater or smaller than a specified value. 

## Indexing Using Integer Arrays

It is possible to choose rows from an array, simply by passing a list of integers which correspond to rows:

In [32]:
ratings[[0,2,4]]

array([[3, 3, 4, 0],
       [4, 0, 2, 0],
       [4, 0, 1, 2]])

If you want to select rows by starting from the end (e.g. if movies are sorted by year and you want the most recent ones) you can use negative indexing:

In [33]:
ratings[[-1,-2]]

array([[3, 4, 3, 0],
       [4, 0, 1, 2]])

Finally, you can choose multiple elements from an array by specifying their position:

In [34]:
ratings[[0,2,4],[2,2,3]]

array([4, 2, 2])

### Arrays Manipulation

Transposition: NumPy offers a very efficient way of transposing a matrix. If we want to transpose the ratings matrix so as each user correspond to a row and each column to a movie, we can simply: 

In [35]:
ratings.T

array([[3, 0, 4, 1, 4, 3],
       [3, 4, 0, 4, 0, 4],
       [4, 5, 2, 0, 1, 3],
       [0, 3, 0, 4, 2, 0]])

Matrix Product: NumPy is also very efficient in calculating the matrix product. In order to multiply two matrices A and B, A needs to have the same number of columns as B's rows. Consider the following example:

In [36]:
A = np.random.randint(10, size=(5,4))
B = np.random.randint(10, size=(4,7))

In [37]:
A

array([[0, 2, 4, 5],
       [7, 2, 0, 2],
       [2, 2, 2, 1],
       [5, 3, 2, 8],
       [5, 9, 9, 2]])

In [38]:
B

array([[3, 6, 4, 3, 6, 7, 1],
       [1, 5, 0, 2, 9, 3, 2],
       [3, 8, 2, 9, 7, 9, 2],
       [9, 3, 9, 0, 7, 0, 5]])

In [39]:
dotproduct = np.dot(A,B)
dotproduct

array([[ 59,  57,  53,  40,  81,  42,  37],
       [ 41,  58,  46,  25,  74,  55,  21],
       [ 23,  41,  21,  28,  51,  38,  15],
       [ 96,  85,  96,  39, 127,  62,  55],
       [ 69, 153,  56, 114, 188, 143,  51]])

## Conditions

numpy.where: this is a function used for conditional logic. Instead of having if... else... rules, the function chooses elements from either a set x (if) or a set y (else). 
Suppose we have an array x:

In [40]:
x = np.array([1,2,3,4,5])
y = np.array([2,3,4,5,6])

And the conditions:

In [41]:
cond = np.array([True, False, True, False, False])

Then, the result can be obtained as follows: 

In [42]:
result = np.where(cond, x, y)
result

array([1, 3, 3, 5, 6])

Additionally,  consider that we have an array with the ages of some people: 

In [43]:
ages = np.random.randint(low=18, high=65, size = (5,5))
ages 

array([[37, 59, 28, 49, 55],
       [59, 42, 33, 64, 39],
       [61, 34, 57, 54, 61],
       [35, 60, 36, 58, 48],
       [49, 41, 25, 32, 43]])

Suppose that we want to turn this matrix into a binary matrix where 0 corresponds to ages<40 and 1 corresponds to ages >= 40: 

In [44]:
np.where(ages<40, 0, 1)

array([[0, 1, 0, 1, 1],
       [1, 1, 0, 1, 0],
       [1, 0, 1, 1, 1],
       [0, 1, 0, 1, 1],
       [1, 1, 0, 0, 1]])

## Simple statistical functions
Numpy makes it easy to perform quick mathematical and statistical calculations such as mean and median. Using the matrix from the previous example we can calculate the mean, median and sum.

In [51]:
ages.mean()

46.36

In [52]:
np.median(ages)

48.0

In [53]:
ages.sum()

1159

We can also calulate the indices of the minimum and maximum values:

In [54]:
ages.argmin()

22

In [55]:
ages.argmax()

8

## I/O with NumPy

With NumPy you can save and load data in either texts or binary formats as arrays. These are stored with the ".npy" file extension. 

We will use the array "ratings", which we created earlier to demonstrate how to save and load data.

In order to store the array in the working directory, the following statement should be used: 

In [59]:
np.save('ratings', ratings)

The following statement will have the same result:

In [60]:
np.save('ratings.npy', ratings)

The extension ".npy" is added automatically in the first example. 

Similarly, in order to load the table we will only need to "load" it from the working directory. 

In [62]:
np.load('ratings.npy')

array([[3, 3, 4, 0],
       [0, 4, 5, 3],
       [4, 0, 2, 0],
       [1, 4, 0, 4],
       [4, 0, 1, 2],
       [3, 4, 3, 0]])

ModeResult(mode=array([[35, 34, 25, 32, 39]]), count=array([[1, 1, 1, 1, 1]]))