<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/Copy_of_0_numpy_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy Arrays and DataTypes

### Introduction

Numpy is a scientific computing and data analysis library in Python.  One of the main benefits of working in numpy is it's speed.  With numpy, we can perform operations on lists -- called numpy arrays -- quickly and in a space efficient way.  Also important, is that libraries like pandas are built on numpy, and thus borrow much of it's functionality.  

In this lesson, we'll begin working with the fundamental data structure in numpy, the numpy array.

### Introduction to Arrays

Let's get started by importing the numpy library and constructing an array.

In [0]:
import numpy as np
python_ages = [12, 22, 20, 15]
ages = np.array(python_ages)
ages

array([12, 22, 20, 15])

We have just initialized a numpy array.  And it has much of the same functionality as a list in Python.

> For example, we can slice or select items of the array.

In [0]:
ages[:2]

array([12, 22])

In [0]:
ages[0]

12

> Notice that we say that a numpy array has *items* as opposed to elements.  You may also hear a an item referred to as an *entry* in an array.

### Creating Arrays

So we just saw one way that we can create an array: initialize a Python list, and then call `np.array`.  Note that we can also convert a Python range into numpy like so.

In [0]:
np.array(range(0, 5))

array([0, 1, 2, 3, 4])

But we can simplify our code by using the numpy equivalent: `arange`.

In [0]:
np.arange(0, 5)

array([0, 1, 2, 3, 4])

### Working with DataTypes

One difference from lists in Python is that in numpy, every item of the array must be of the same type.

In [0]:
ages.dtype

dtype('int64')

If we even try to modify our array so that there is a mismatch of types, we see an error.

In [0]:
ages[0] = 'hello'

ValueError: invalid literal for int() with base 10: 'hello'

> Enforcing that items are of the same type is a key component to what makes numpy work quickly.  By enforcing type, numpy can have space constraints on each item, and thus on the entire array.  This saves space in memory.  

We can convert the items in our array from one datatype to another using the `astype` method. 

In [0]:
ages

array([12, 22, 20, 15])

In [0]:
ages.astype(np.float64)

array([12., 22., 20., 15.])

> One thing to note is that we are *copying* and then changing our array.  The original `ages` array is unchanged with type `int64`.

In [0]:
ages.dtype

dtype('int64')

We can see a list of array types [here](https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html).  It's not important to know them all.  Many, integer, float, boolean, string are similar to the types in Python.  

If you'd like to see if a certain type exists, we can check it by referencing the numpy library.

In [0]:
import numpy as np
np.float64

np.float16

numpy.float16

> Notice that Numpy is trying to keep each item as space-efficient as possible.  The numbers 16 32, and 64 indicate different memory constraints for the array.  The allocation for each item is one eighth the size of the memory space for the array.

In [0]:
np.array([214748364721474836]).astype(np.int64).itemsize

8

> Change the above to `int16` to see how the itemsize changes.

Let's turn our array of ages into a string.  Why? Just for fun.

In [0]:
python_ages = [12, 22, 20, 15]
ages = np.array(python_ages)


str_ages = ages.astype(np.str_)

str_ages

array(['12', '22', '20', '15'], dtype='<U21')

See? Fun :)

### Broadcasting

Beyond have all items of the same type, the other major difference between Python and Numpy is Numpy's ability to broadcast.  Let's see what this means.

Let's say that we wanted to increase each person's age by 2.  In Python we accomplish this with a for loop.

In [0]:
ages = [20, 12, 15]

In [0]:
[age + 2 for age in ages]

[22, 14, 17]

In Numpy, we can accomplish this with the following:

In [0]:
2 + np.array(ages)

array([22, 14, 17])

This is called broadcasting.

> The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. 

> [Numpy documentation](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)

In other words, we can think of numpy performing the following in the cell above.

In [0]:
ages = [20, 12, 15]

# 2 + np.array(ages)


np.array([2, 2, 2, 2]) + np.array(ages)


array([14, 24, 22, 17])

> The number 2 is broadcast to have the shape of the larger array `ages`.

One thing to note about broadcasting, is that it can speed up our operations.

For example, let's create an array of length 1000.

In [0]:
ages = list(range(1000))

np_ages = np.array(ages)

In [0]:
%%time
ages_plus = [age + 2 for age in ages]
    

CPU times: user 93 µs, sys: 0 ns, total: 93 µs
Wall time: 98 µs


In [0]:
%%time
ages_plus_arr = 2 + np_ages

CPU times: user 39 µs, sys: 12 µs, total: 51 µs
Wall time: 46 µs


> So we can see that the native Python operation takes twice as long as the Numpy operation.

We won't take a deep dive into broadcasting right now, but we can think of broadcasting as an alternative to looping.  And we just saw how we apply the same coercion to an array of items.

In [0]:
2* np.array(ages)

array([40, 24, 30])

### Summary

In this lesson, we learned some of the fundamental differences of working with arrays in Numpy as opposed to Python.  The first is that all items in an array are of the same type.  If we do not specify a type, Numpy will choose an appropriate one for us.  We can explicitly set the type of an array with the `np.astype` method.  And we can refer to each type with `np.` followed by the type, for example, `np.float64`.

Then we saw the second major difference in Numpy, which is broadcasting.  Broadcasting occurs when we have a smaller array (or single item) that we then apply to the larger array.  We can oftentimes use this procedure to replace loops in Python, saving us both code, and computation time.

### Resources 

[Broadcasting in Numpy](https://towardsdatascience.com/why-you-should-forget-for-loop-for-data-science-code-and-embrace-vectorization-696632622d5f)