# Introduction to NumPy

__NumPy__ (as well as __SciPy__) is a Python package to efficiently do data science. Learn to work with the __NumPy__ array, a faster and more powerful alternative to the list, and take your first steps in data exploration.


It provides:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities
- and much more

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.


<font color="orange">
## Why you should use ndarray objects?
</font>
...

# Lists Recap
--------------------

- Powerful
- Collection of values
- Hold different types
- Change, add, remove
<font color="red">
- Need for Data Science
    - mathematical operations over collections
    - Speed
</font>

# Illustration with an example
------------------------------------------


<font color=green>
#### Calculate the body mass index for each family member
</font>
\begin{equation*}
BMI = \frac{weight[kg]}{height[cm]^{2}}
\end{equation*}


In [1]:
# Illustration
# height and weight of all your family ;)

height = [1.73, 1.68, 1.71, 1.25, 1.89, 1.58]

weight = [65.4, 59.2, 63.6, 23.0, 88.4, 68.7]


In [None]:
# Calculate the body mass index for each family member
bmi = weight / height ** 2

<font color="red">
### Unfortunately, python shows and error, because it __has no idea how to do calculation with list!!__
</font>
However, it could solve this by going one element after the other and calculate the BMI for each member of the family separetly
```
bmi = []
for each element of a pair of height and weight:
    add w/h**2.0 to the list BMI
```
but this is terrebly inefficient (**Exercise 10**).


In [None]:
# do the loop to compute the BMI for each member of the family
bmi = []

# define the list to store all the BMI values
talla = zip(weight,height)

print "\t * talla is a ", type(talla), " of ", type(talla[0])
print "\t * Showing the first 2 elements of talla: ", talla[:2]

# for each member of the family compute its BMI
for w,h in talla:
    bmi.append( w/h**2)
    
print "\t * Showing the first two elements of bmi: " ,bmi[:2]


# Solution: NumPy

A more elegant solution is to use  [__NumPy__](https://www.numpy.org/) or __Numeric Python__
- NumPy Array is an alternative to regular __Python List__
- calculations over entire arrays
- easy and fast

<font color=green>
### Using array objects, calculate the BMI for each member of the family
</font>
#### Instructions
> - Import numpy as np (as it is usually known)
> - Create a NumPy array for the variable __np_height__ by calling the numpy function array
> - Show the output of __np_height__
> - Create a NumPy array for the variable __np_weight__ by calling the numpy function array
> - Calculate every body BMI with a single call array and store it under the variable __np_bmi__

In [None]:
# import numpy as np
import numpy as np

# create an array for the height
np_height = np.array(height)

# create an array for the weight
np_weight = np.array(weight)

# in once, calculate the bmi of all members of the family
np_bmi = np_weight/np_height**2

# show the BMI
print "\t * The BMI of each member of the family is ", np_bmi

In [None]:
# re-define the bmi (from DIY)
bmi_ex1 = np.array( bmi )

# ratio between the two bmi arrays
ratio_bmi = np_bmi/bmi_ex1

# print the ratio and compare both results are compatible
print "\n\t * Are both results compatible?", ratio_bmi


# Type Of NumPy Arrays
------------------------------------------


If you know what a matrix is, numpy ndarrays are just that!
- __1D array__ is just a vector (__height_in, weight_lb__ in the example)
- __2D array__ is just a matrix,  a set of numbers arranged in rows and columns so as to form a rectangular array (__talla__ in the example)
- __ndarray__ is just a matrix of dimension NxM

In [None]:
# load the data by using the pandas package (NEXT CLASS)
import pandas as pd
data = pd.read_csv("data/numpy_intro_data.csv", index_col="row")

# read height and weight as 1D array
height_in = data["height (inches)"].values
weight_lb = data["weight (pounds)"].values

# show table (first 5 rows)
data.head(10)

If you ask for the type of ```height_in``` or ```weight_lb```, numpy tells you is an object classified as an **ndarray**, which means that

            ndarray = N-dimensional array

It is possible to create a 1D array, 2D array, 3D array, ... 7D array


## 2D NumPy Arrays
We can create a 2D array from two 1D arrays like this

```python
In [1]: talla_2d = np.column_stack( (height_in,weight_lb) )

```

If you print out ```talla_2d``` you can see that it is a rectangular data structure

each sublist in the list, corresponds to a row in a two dimensional array; and each row corresponds to a 1D array, i.e.  ```height_in``` and ```weight_lb```.

from ```shape``` we can see that we end it with 1015 rows and 2 columns

```python
In [1]: talla_2d.shape
Out[1]: (1015,2)
```


```shape``` is an attribute of the ```ndarray``` and can give you more information about how the data looks like.


In [None]:
# create a 2D array
talla_2d = np.column_stack( (height_in,weight_lb) )

talla_2d

In [None]:
talla_2d.shape


### Note that one array can only contain only one single type

If you change one single element from the array to be __string__ all the array will be converted to __string__  to end up with an homogenious array
```python
In [2]: np.array([65.2,59.2,63.6,"88.4"])
Out[2]: array(['65.2', '59.2', '63.6', '88.4'],dtype='|S32')
```


In [None]:
np.array([65.2,59.2,63.6,"88.4"])

In [None]:
[65.2,59.2,63.6,"88.4"]

# NumPy Remarks
- NumPy arrays: contain only one type (example A)
- NumPy arrays comes with its own operations (example B)

<font color=red>
#### DIFFERENT TYPES, DIFFERENT BEHAVIOR!!
</font>
### Let's see it with examples

In [None]:
#####    EXAMPLE A  ####
# list of three elements all of them of different types
l_of_difftypes = [1.0, "is", True]

# convert this list into an array, what happen?
np_of_difftypes = np.array( l_of_difftypes )

print "\n\t * All elements are converted to strings ", np_of_difftypes

In [None]:
#####    EXAMPLE B   ####
l1 = [1,2,3]
l2 = [4,5,6]

# sum l1 and l2
l_sum = l1+l2

# convert l1 and l2 into arrays, as a1 and a2
a1 = np.array(l1)
a2 = np.array(l2)

# sum a1 and a2
a_sum = a1+a2

# do you get the same results for l1+l2 and a1+a2?
print "\n\t * the sum of two lists is another list where the two list are based together ", l_sum
print "\t * however, the sum of two arrays is the sum of the individual elements ", a_sum


# NumPy Subsetting
------------------------------------------


You've seen it with your own eyes: Python ```lists``` and numpy ```arrays``` sometimes behave differently. Luckily, there are still certainties in this world. For isntance, subsetting (using the square bracket notation on lists or arrays) works exactly the same.

There are several ways to access to the elements of an array:
- as you do in lists
- by using a boolean array (to access at some specific elements, for instance those with bmi>23)



In [None]:
data.head(7)

#### Selecting different elements from the table:

1. Selecting with brackets
Supouse you want the first row and then the 2th element in that row. To select the row you need to insert the index 0 between brackets
```python
In [10]: talla_2d[0]
Out[10]: array([ 74, 180])
```
to select the 2th element, you just extend the same call with just an extra brackets
```python
In [11]: talla_2d[0][1]
Out[11]: 180
```
In general, you select one row, and from that row you select the column

2. Selecting with commas, this call is as follows
```python
In [12]: talla_2d[0,1]
Out[11]: 180
```
the value before the comma specify the row, and the one after the column

3. Selecting specific regions
Suppose you want to select the height and weight of the 2nd and 3th member of the family. So., to select both both rows you just put __:__ before the coma, and to select the 2nd and 3th columns, use __1:3__ to select consecutive columns
```python
In [14]: np_2d = np.array([[1.73, 1.68, 1.71, 1.25, 1.89, 1.58], [65.4, 59.2, 63.6, 23.0, 88.4, 68.7]])
In [15]: np_2d[:,1:3]
Out[15]: array([[  1.68,   1.71],
                [ 59.2 ,  63.6 ]])
```
The intersection gives us an array with two rows and two columns. __DIY: Select all family weights__

4. Selecting with boolean arrays (to select specific values within the *table*)

```python
In [16]: np_bmi
Out[16]: array([ 21.85171573,  20.97505669,  21.75028214,  14.72      ,
         24.7473475 ,  27.51962826])
         
In [17]: np_bmi_mask = np_bmi>23

In [18]: np_bmi[np_bmi_mask]
Out[18]: array([ 24.7473475 ,  27.51962826])

```


In [None]:
# show the second element of the array np_bmi
print "\t * the second element of the array np_bmi is", np_bmi[2]

# show the last element of an array
print "\t * the last element of the array np_bmi is", np_bmi[-1]

# show only those np_bmi > 23
np_bmi_mask = np_bmi>23

print "\t * Showing only those bmi higher than 23: ", np_bmi[ np_bmi_mask ] 


# Let's practice
----------------------



<font color=green>
# Exercise 1. Your First NumPy Array
</font>
In this exercise, we're going to dive into the world of baseball. Along the way, you'll get comfortable with the basics of numpy, a powerful package to do data science.

A list baseball has already been defined in the Python script, representing the height of some baseball players in centimeters. Can you add some code here and there to create a numpy array from it?

#### Instructions
> - Import the numpy package as np, so that you can refer to numpy with np.
> - Use np.array() to create a numpy array from baseball. Name this array np_baseball.
> - Print out the type of np_baseball to check that you got it right.

In [None]:
# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Import the numpy package as np
import numpy as .....

# Create a numpy array from baseball: np_baseball
np_baseball = np.array(......)

# Print out type of np_baseball
print "\t * the type of np_baseball is ", type(np_baseball)

<font color=green>
# Exercise 2. Baseball players' height
</font>
You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored as a regular Python list: __height_in__. The height is expressed in inches. __Can you make a numpy array out of it and convert the units to meters?__

In [None]:
# load the data from the csv file by using the loadtxt numpy function
# height and weight are available as a regular lists
index,np_age_year, height_in, weight_lb = np.loadtxt("data/numpy_intro_data.csv", delimiter=",", 
                                                  unpack=True, skiprows=1)

# convet into list objects type
height_in = list(height_in)
weight_lb = list(weight_lb)



#### Instructions
> - Create a numpy array from height_in. Name this new array np_height_in.
> - Print np_height_in.
> - Multiply np_height_in with 0.0254 to convert all height measurements from inches to meters. Store the new values in a new array, np_height_m.
> - Print out np_height_m and check if the output makes sense.


In [None]:

# Create a numpy array from height_in: np_height_in
np_height_in = np.array(........)

# Print out np_height_in
print "\t * height in inches ", np_height_in

# Convert np_height_in to m: np_height_m
np_height_m = ....... * 0.0254

# Print np_height_m
print "\t * height in meters ", np_height_m

<font color=green>
# Exercise 3.  Baseball player's BMI
</font>
The MLB also offers to let you analyze their weight data. Again, both are available as regular Python lists: height_in and weight_lb. height_in is in inches and weight_lb is in pounds.

It's now possible to calculate the BMI of each baseball player. Python code to convert height_in to a numpy array with the correct units is already available in the workspace (previous exercise). Follow the instructions step by step and finish the game!

#### Instructions
> - Create a numpy array from the __weight_lb__ list with the correct units: multiply by 0.453592 to go from pounds to kilograms. Store the resulting numpy array as __np_weight_kg__.
> - Use __np_height_m__ and __np_weight_kg__ to calculate the BMI of each player.
> - Save the resulting numpy array as __bmi__.
> - Print out __bmi__.

In [None]:
# height and weight are available as regular lists

# Create array from weight_lb with metric units: np_weight_kg
np_weight_kg = ....(....) * ....

# Calculate the BMI: bmi
bmi = ..... / ....**2

# Print out bmi
print "\n\t * The BMI of each baseball player are ", bmi

<font color=green>
# Exercise 4. Lightweight baseball players
</font>
To subset both regular Python lists and numpy arrays, you can use square brackets:

```python
x = [4 , 9 , 6, 3, 1]
x[1]
import numpy as np
y = np.array(x)
y[1]
```

For ```numpy``` specifically, you can also use boolean ```numpy``` arrays:
```python
high_mask = y > 5
y[high_mask]
```
The code that calculates the BMI of all baseball players is already included (previous exercise). Follow the instructions and reveal interesting things from the data!

#### Instructions
> - Create a boolean ```numpy array```: the element of the array should be True if the corresponding baseball player's BMI is below 21. You can use the < operator for this. Name the array ```light```.
> - Print the array ```light```.
> - Print out a ```numpy array``` with the BMIs of all baseball players whose BMI is below 21. Use ```light``` inside square brackets to do a selection on the ```bmi``` array.


In [None]:
# Create the light array
light = .... < 21

# Print out light
print "\n\t * Mask for those players with bmi < 21: ", light

# Print out BMIs of all baseball players whose BMI is below 21
print "\t * BMI of the light players: ", ....[....]

print "\n\t * There are {0} players with BMI below 21 ".format(len(....[....]))

<font color=green>
# Exercise 5. Lists vs Arrays
</font>
As we said, ```numpy``` is great for doing vector arithmetic. If you compare its functionality with regular Python ```lists```, however, some things have changed.

- First of all, <font color=red>numpy arrays cannot contain elements with different types</font>. If you try to build such a list, some of the elements' types are changed to end up with a homogeneous list. __This is known as type coercion__.

- Second, the typical arithmetic operators, such as ```+```, ```-```, ```*``` and ```/``` have a different meaning for regular Python lists and numpy arrays.

Have a look at this line of code:
```python
np.array([True, 1, 2]) + np.array([3, 4, False])
```
#### Can you tell which code chunk builds the exact same Python object? The ```numpy``` package is already imported as np, so you can start experimenting in the IPython Shell straight away!

#### Possible Answers
> - ```np.array([True, 1, 2, 3, 4, False])```
> - ```np.array([4, 3, 0]) + np.array([0, 2, 2])```
> - ```np.array([1, 1, 2]) + np.array([3, 4, -1])```
> - ```np.array([0, 1, 2, 3, 4, 5])```

In [None]:
### play to find the correct answer

<font color=green>
# Exercise 6. Subsetting NumPy Arrays
</font>
You've seen it with your own eyes: Python ```lists``` and numpy ```arrays``` sometimes behave differently. Luckily, there are still certainties in this world. For example, subsetting (using the square bracket notation on lists or arrays) works exactly the same. To see this for yourself, try the following lines of code in the IPython Shell:
```python
x = ["a", "b", "c"]
x[1]

np_x = np.array(x)
np_x[1]
```
#### Instructions
> - Subset ```np_weight_lb``` by printing out the element at index 50.
> - Print out a sub-array of ```np_height_in``` that contains the elements at index 100 up to and including index 108.

In [None]:
# height and weight are available as a regular lists

# Import numpy
# import numpy as np

# Store weight and height lists as numpy arrays
np_weight_lb = np.array(.....)
np_height_in = np.array(.....)

# Print out the weight at index 50
print "\n\t * The weight of the 50th player is ", .....[....]

# Print out sub-array of np_height_in: index 100 up to and including index 108
print "\t * The height of the 100 up to 108th are ", .....[.....]

<font color=green>
# Exercise 7. Your First 2D NumPy Array
</font>
Before working on the actual MLB data, let's try to create a 2D ```numpy``` array from a small list of lists.

In this exercise, ```baseball``` is a list of lists. The main list contains 4 elements. Each of these elements is a list containing the height and the weight of 4 baseball players, in this order. ```baseball``` is already coded for you in the script.

#### Instructions
> - Use ```np.array()``` to create a 2D numpy array from ```baseball```. Name it ```np_baseball```.
> - Print out the type of ```np_baseball```.
> - Print out the shape attribute of ```np_baseball```. Use ```np_baseball.shape```.


In [None]:
# Import numpy
# import numpy as np
# Create baseball, a list of lists
baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(.....)

# Print out the type of np_baseball
print "\t * type of np_baseball ", type(....)

# Print out the shape of np_baseball
print "\t * shape of np_baseball ", .....

<font color=green>
# Exercise 8. Baseball data in 2D form
</font>
You have another look at the MLB data and realize that it makes more sense to restructure all this information in a 2D numpy array. This array should have 1015 rows, corresponding to the 1015 baseball players you have information on, and 2 columns (for ```np_height_kg``` and ```np_weight_m```).

        rows :: observations/ samples
        columns :: attributes/ features of each observation (sample)

Can you store the data as a 2D array to unlock numpy's extra functionality?

#### Instructions
> - Use zip() to create a list containing tuple of pairs where the first element is the players' height in m and the second holds player weight, in kg. 
> - Use np.array() to create a 2D numpy array from baseball. Name it np_baseball.
> - Print out the shape attribute of np_baseball.


In [None]:
# baseball is available as a regular list of lists
# from Ex 1, we have np_weight_kg, np_height_m
talla_l = zip(np_height_m, ....)

# Create a 2D numpy array from height_in, weight_lb: np_baseball
np_baseball = np.array( .... )

# another way will be using np.column_stack
# np_baseball = np.column_stack( (np_height_m, np_weight_kg) )

# Print out the shape of np_baseball
print "\t * The shape of np_baseball is ", np_baseball.shape

<font color=green>
# Exercise 9. Subsetting 2D NumPy Arrays
</font>
If your 2D numpy array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Have a look at the code below where the elements "a" and "c" are extracted from a list of lists.

```python

# regular list of lists
x = [["a", "b"], ["c", "d"]]
[x[0][0], x[1][0]]

# numpy
import numpy as np
np_x = np.array(x)
np_x[:,0]

```
For regular Python lists, this is a real pain. For 2D ```numpy``` arrays, however, it's pretty intuitive! The indexes before the comma refer to the rows, while those after the comma refer to the columns. The ```:``` is for slicing; in this example, it tells Python to include all rows.

        np_array[ row_index, column_index]

The code that converts the pre-loaded baseball list to a 2D numpy array is already done. ```np_baseball```: the first column contains the players' height in inches and the second column holds player weight, in pounds. Add some lines to make the correct selections. Remember that in Python, the first element is at index 0!

#### Instructions
> - Print out the 50th row of np_baseball.
> - Make a new variable, np_weight_lb, containing the entire second column of np_baseball.
> - Select the height (first column) of the 124th baseball player in np_baseball and print it out.

In [None]:
# Print out the 50th row of np_baseball
print ....[49,:]

# Select the entire second column of np_baseball: np_weight_lb
print ....

# Print out height of 124th player
print ....

<font color=green>
# Exercise 10. Compute the BMI using lists
</font>
#### Instructions
> - define a variable to store all the BMI values as __bmi__
> - define a list of tuples, where each tuple contain the value of the weight and height of each member of the family, __talla__
> - using the loop __for__, go over all values of __talla__, compute its BMI value, and add it to the list __bmi__

In [None]:
# assuming the lists
height = [1.73, 1.68, 1.71, 1.25, 1.89, 1.58]

weight = [65.4, 59.2, 63.6, 23.0, 88.4, 68.7]

# do the loop to compute the BMI for each member of the family
bmi = []

# define the list to store all the BMI values
talla = zip(weight,height)

print "\t * talla is a ", type(talla), " of ", type(talla[0])
print "\t * Showing the first 2 elements of talla: ", talla[:2]

# for each member of the family compute its BMI
for ... , .... in ....:
    bmi.append( w/h**2)
    
print "\t * Showing the first two elements of bmi: " , ....

<font color=green>
# Exercise 11. 2D Arithmetic
</font>
Remember how you calculated the Body Mass Index for all baseball players? ```numpy``` was able to perform all calculations element-wise (i.e. element by element). For 2D ```numpy``` arrays this isn't any different! You can combine matrices with single numbers, with vectors, and with other matrices.

Execute the code below in the IPython shell and see if you understand:

```python
import numpy as np
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat * 2
np_mat + np.array([10, 10])
np_mat + np_mat
```

```np_baseball``` is coded for you: it's again a 2D ```numpy``` array with 3 columns representing height (in inches), weight (in pounds) and age (in years). 

In [None]:
np_baseball = np.column_stack((np_height_in, np_weight_lb, np_age_year) )
np_baseball.shape

In [None]:
# creating the update 3D array
Nrows = 1015

np_uncertanty = np.column_stack( (np.random.normal(1.008,0.20,Nrows), np.random.normal(-0.21,10.089,Nrows), 
                         np.zeros_like(np_baseball.T[2])) )
uncertanty.shape

#### Instructions
> - You managed to get hold of the changes in height, weight and age of all baseball players. It is available as a 2D numpy array, **np_uncertanty**. Add **np_baseball** and **np_uncertanty** and print out the first 5 elements of result.
> - You want to convert the units of height and weight to metric (meters and kilograms respectively). As a first step, create a numpy array with three values: 0.0254, 0.453592 and 1. Name this array conversion.
> - Multiply np_baseball with conversion and print out the first 5 elements of the result.

In [None]:
# Print out addition of np_baseball and np_uncertanty
result = .... + ....
print ....[:5]

# Create numpy array: conversion
conversion = np.array([0.0254, 0.453592,1])


In [None]:

# Print out product of np_baseball and conversion
np_baseball_su = .... * conversion 

# print out 5 first elements of the results
print ....[:5]