## Getting started with Jupyter Notebooks

Before we begin, let's cover a few basics about Jupyter (also known as iPython) notebooks. This is a markdown cell. Select a cell (a blue boundary should appear around the cell; you're in command mode) and press Enter on your keyboard to see it in editing mode (the cell boundary turns green in color). To switch back to command mode, press Esc. To run the cell, and go to the next cell, press Shift+Enter. If you want to run the cell without advancing to the next, press Ctrl+Enter.

To create a new cell below the current one, switch to command mode and then press the letter "b" on your keyboard. "a" will create a new cell above the current cell. "dd" will delete the current cell.

## NumPy Arrays (ndarrays)

In [1]:
import numpy as np

In [6]:
arr = np.array([1,3,5,7])
arr

array([1, 3, 5, 7])

Why ndarrays?
- efficient vectorized, elementwise operations for homogeneous data (sometimes orders of magnitude faster than in "pure Python")
- provides foundation for operations in **pandas**

### Creating an ndarray

In [7]:
# We already saw this

np.array([1,3,5,7])

array([1, 3, 5, 7])

In [11]:
# Similar to range() function used in for loops and other places

np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
# Random numbers sampled from Gaussian distribution, mean = 0 and variance = 1
# Using different mean/variance: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html

np.random.randn(5)

array([ 0.08670903,  0.45333811, -1.18501615, -0.06855212, -0.15768225])

In [14]:
# 2-D version

np.random.randn(5,2)

array([[ 0.73231391,  0.24281052],
       [ 1.1215952 , -1.31492946],
       [-0.03164894,  1.49597231],
       [ 0.95475041,  1.24291866],
       [-0.82912349,  1.26414818]])

In [20]:
# All ones

np.ones(10)

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [21]:
# All zeros

np.zeros(10)

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [29]:
# From a text file

my_arr = np.loadtxt('loadarray.txt')
my_arr

array([[  1.45,   3.46],
       [  2.36,   6.43],
       [  4.65,  10.24],
       [  5.43,   3.25]])

### Investigating an ndarray

In [30]:
# Number of dimensions

my_arr.ndim

2

In [31]:
# Shape

my_arr.shape

(4, 2)

In [32]:
# Data type

my_arr.dtype

dtype('float64')

### Casting from one dtype to another

In [36]:
# Let's get a copy of my_arr as an integer array

my_intarr = my_arr.astype(int)
my_intarr
#rounded_arr = np.rint(my_arr)    # numbers rounded instead of truncated
#rounded_arr

array([[ 1,  3],
       [ 2,  6],
       [ 4, 10],
       [ 5,  3]])

Similarly, you can cast an array from/into a float or string.

In [37]:
years = np.array(['2000','1999','2003','1998','1999','2004'])
years.astype(int)

array([2000, 1999, 2003, 1998, 1999, 2004])

### Indexing and slicing

For 1D arrays, slicing and indexing are similar to corresponding operations on lists.

In [38]:
sales = np.array([20, 30, 31, 33, 33, 35, 40, 410, 410, 45])
sales[7:9]

array([410, 410])

However, if a scalar value is assigned to a slice, the value is broadcasted to the original array

In [40]:
sales_slice = sales[7:9]
sales_slice[:] = 41
sales

array([20, 30, 31, 33, 33, 35, 40, 41, 41, 45])

What if we have a 2D array?

In [41]:
my_arr

array([[  1.45,   3.46],
       [  2.36,   6.43],
       [  4.65,  10.24],
       [  5.43,   3.25]])

In [42]:
my_arr[0,1]

3.46

In [43]:
my_arr[:2, 0]

array([ 1.45,  2.36])

For a 2D array, the concept of axis0 vs. axis1 is something you'll want to be familiar with, especially once we move into pandas. (See this Stack Overflow discussion if you're unsure: https://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean)

### Mathematical and statistical operations, functions, and methods

Remember, *elementwise* operations.

In [44]:
arr1 = np.arange(10)
arr2 = arr1 + 2

In [45]:
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [46]:
arr2

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [70]:
arr1 + arr2

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In [73]:
arr2 * arr2

array([  4,   9,  16,  25,  36,  49,  64,  81, 100, 121])

In [74]:
arr2 ** 0.5

array([ 1.41421356,  1.73205081,  2.        ,  2.23606798,  2.44948974,
        2.64575131,  2.82842712,  3.        ,  3.16227766,  3.31662479])

In [75]:
1/arr2

array([ 0.5       ,  0.33333333,  0.25      ,  0.2       ,  0.16666667,
        0.14285714,  0.125     ,  0.11111111,  0.1       ,  0.09090909])

**Universal functions, or ufuncs:**

In [47]:
# Equivalent to arr2 * arr2

np.square(arr2)

array([  4,   9,  16,  25,  36,  49,  64,  81, 100, 121], dtype=int32)

In [48]:
# Equivalent to arr2 ** 0.5

np.sqrt(arr2)

array([ 1.41421356,  1.73205081,  2.        ,  2.23606798,  2.44948974,
        2.64575131,  2.82842712,  3.        ,  3.16227766,  3.31662479])

In [55]:
# Absolute values

mixed_arr = np.array([-1,3,2,-3,5,4,3,-6,-4,5])
np.abs(mixed_arr)

array([1, 3, 2, 3, 5, 4, 3, 6, 4, 5])

In [56]:
# Signs

np.sign(mixed_arr)

array([-1,  1,  1, -1,  1,  1,  1, -1, -1,  1])

In [57]:
# Comparing two arrays to get elementwise maxima

zeros_arr = np.zeros(10)
np.maximum(mixed_arr, zeros_arr)

array([ 0.,  3.,  2.,  0.,  5.,  4.,  3.,  0.,  0.,  5.])

This is only a small sample of the many ufuncs that are out there - for more ufuncs, check out the documentation: https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs

### Array methods

In [58]:
my_arr

array([[  1.45,   3.46],
       [  2.36,   6.43],
       [  4.65,  10.24],
       [  5.43,   3.25]])

In [59]:
# Sum of all elements

my_arr.sum()

37.269999999999996

In [60]:
# Arithmetic mean

my_arr.mean()

4.6587499999999995

In [61]:
# Standard deviation (optionally, adjust degrees of freedom used in calculation via ddof parameter)

my_arr.std()

2.5952959248417127

In [62]:
# Variance (ddof adjustable)

my_arr.var()

6.7355609374999998

In [63]:
# Maximum of all elements

my_arr.max()

10.24

In [64]:
# Minimum of all elements

my_arr.min()

1.45

In [65]:
# What if I want the maximum value in each row?

my_arr.max(axis=1)

array([  3.46,   6.43,  10.24,   5.43])

In [66]:
# Finding the indices of the maximum element of the array

my_arr.argmax()

5

Okay...

In [67]:
# for > 1D, need to "unravel" the index to get a informative result

np.unravel_index(my_arr.argmax(), my_arr.shape)

(2, 1)

For more, check out the "Methods" subsection here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html

### Masking and more with booleans (True/False values)

In [69]:
# Creating an array of the top 40 U.S. cities by population

top40_arr = np.array(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia', 'Phoenix', 'San Antonio', 'San Diego',
         'Dallas', 'San Jose', 'Austin', 'Jacksonville', 'San Francisco', 'Indianapolis', 'Columbus', 'Fort Worth',
         'Charlotte', 'Seattle', 'Denver', 'El Paso', 'Detroit', 'Washington', 'Boston', 'Memphis', 'Nashville', 'Portland',
         'Oklahoma City', 'Las Vegas', 'Baltimore', 'Louisville', 'Milwaukee', 'Albuquerque', 'Tucson', 'Fresno', 'Sacramento',
         'Kansas City', 'Long Beach', 'Mesa', 'Atlanta', 'Colorado Springs'])

In [70]:
# Which cities start with a letter in the second half of the alphabet?

top40_arr >= 'N'

array([ True, False, False, False,  True,  True,  True,  True, False,
        True, False, False,  True, False, False, False, False,  True,
       False, False, False,  True, False, False,  True,  True,  True,
       False, False, False, False, False,  True, False,  True, False,
       False, False, False, False], dtype=bool)

In [71]:
# Creating a second array of the same length

new_arr = np.random.randn(40,2)
new_arr

array([[ 0.26432544,  0.16208614],
       [ 0.08823217, -0.80434646],
       [ 0.22107264,  0.8952034 ],
       [ 0.15990068,  1.62532846],
       [ 1.03174265, -0.27502067],
       [-0.83461485,  2.23119979],
       [-0.5165646 ,  0.95951543],
       [ 0.99477588, -0.97111966],
       [-2.21849996, -0.44062606],
       [ 0.22448905,  0.11457922],
       [-0.69462664,  0.0347786 ],
       [ 0.30274307, -1.07990909],
       [ 0.36468191, -0.83279376],
       [ 0.30223016,  1.08173924],
       [ 0.05219601,  1.5957482 ],
       [ 0.46218633, -0.65647843],
       [-1.79786734, -0.4447802 ],
       [-2.81981526, -0.04427499],
       [ 1.92298891,  0.57108383],
       [ 0.63328914,  1.2484413 ],
       [-0.23548085,  1.14898168],
       [-0.91157772, -0.37185659],
       [ 1.24633845,  0.48027028],
       [-0.92060679, -0.32799263],
       [-1.01079219, -0.28695451],
       [-0.39518355,  1.09540998],
       [ 0.40780184, -0.28839561],
       [ 0.48487683, -2.30801649],
       [-0.99194341,

In [72]:
# Using the boolean array as a mask for a second array

mask = top40_arr>='N'
new_arr[mask]

array([[ 0.26432544,  0.16208614],
       [ 1.03174265, -0.27502067],
       [-0.83461485,  2.23119979],
       [-0.5165646 ,  0.95951543],
       [ 0.99477588, -0.97111966],
       [ 0.22448905,  0.11457922],
       [ 0.36468191, -0.83279376],
       [-2.81981526, -0.04427499],
       [-0.91157772, -0.37185659],
       [-1.01079219, -0.28695451],
       [-0.39518355,  1.09540998],
       [ 0.40780184, -0.28839561],
       [ 0.67233567, -0.83247363],
       [-0.05106747, -0.8113725 ]])

In [73]:
# Compound masks also work

mask2 = (top40_arr>='N')&(top40_arr<='R')
new_arr[mask2]

array([[ 0.26432544,  0.16208614],
       [ 1.03174265, -0.27502067],
       [-0.83461485,  2.23119979],
       [-1.01079219, -0.28695451],
       [-0.39518355,  1.09540998],
       [ 0.40780184, -0.28839561]])

In [74]:
# Checking if there are any top-40 cities that start with a letter after "X"

maskP = top40_arr>'X'
maskP.any()

False

In [75]:
# Checking if all top-40 cities in the array are in title case

maskCapit = np.chararray.istitle(top40_arr)
maskCapit.all()

True

### Some matrix operations (just the tip of the iceberg)

In [76]:
# Creating a matrix using an ndarray object

mat = np.array([[2,5],
                [6,7]])

In [77]:
# Tranposing the matrix

mat.T

array([[2, 6],
       [5, 7]])

In [78]:
# Calculating X'X where X' is the transpose of X

matT = mat.T
matT.dot(mat)    # you can also use "np.dot(matT,mat)"

array([[40, 52],
       [52, 74]])

In [79]:
# 2x2 identity matrix

np.identity(2)

array([[ 1.,  0.],
       [ 0.,  1.]])

In [80]:
# Finding the inverse of the matrix

np.linalg.inv(mat)

array([[-0.4375,  0.3125],
       [ 0.375 , -0.125 ]])

In [81]:
# Finding the determinant of the matrix

np.linalg.det(mat)

-15.999999999999998

Other linear algebra methods can be found here: https://docs.scipy.org/doc/numpy/reference/routines.linalg.html

**Exercise 1:**

The Iris dataset is a well-known data source for teaching machine learning classification algorithms. There are four non-class attributes: sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm). Each row corresponds to measurements from one iris plant.
Using the provided array, calculate the minimum, maximum, mean, and standard deviation for each attribute (you can assume the order of attributes above reflects the order of the columns in the array). Create a new array that excludes outliers (for this exercise, flowers with a value more than 2 standard deviations away from the mean for any of the attributes).

*Hint*: The axis can be specified for the arr.any() and arr.all() methods.

In [82]:
from sklearn import datasets

iris_data = datasets.load_iris()
iris_arr = iris_data.data

**Answer 1:**

In [83]:
## ENTER CODE HERE

iris_arr.shape    # how many rows are we working with?

(150, 4)

In [84]:
iris_arr.min(axis=0)    # minimum values

array([ 4.3,  2. ,  1. ,  0.1])

In [85]:
iris_arr.max(axis=0)    # maximum values

array([ 7.9,  4.4,  6.9,  2.5])

In [86]:
mean_vals = iris_arr.mean(axis=0)    # mean values
mean_vals

array([ 5.84333333,  3.054     ,  3.75866667,  1.19866667])

In [87]:
std_vals = iris_arr.std(axis=0)    # standard deviations
std_vals

array([ 0.82530129,  0.43214658,  1.75852918,  0.76061262])

In [88]:
# To create a mask against outliers, first get boolean array differentiating outliers vs. non-outliers

notOutlier = (iris_arr <= mean_vals+2*std_vals)&(iris_arr >= mean_vals-2*std_vals)
notOutlier

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
 

In [89]:
# To use the mask, we need to flatten the array to 1D

compressed = notOutlier.all(axis=1)
compressed

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [90]:
# Finally, use the mask to exclude rows with outliers

iris_arr[compressed]
#iris_arr[compressed].shape

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4.8,  3.1,  1.6,  0.2],
       [ 5.4,  3.4,  1.5,  0.4],
       [ 4

*Reference*:

The following material was consulted during development of this notebook, which loosely follows the structure of McKinney's chapter on NumPy Basics:

Wes McKinney. "Chapter 4 - Numpy Basics: Arrays and Vectorized Computation." *Python for Data Analysis : Data Wrangling with Pandas, Numpy, and Ipython.* O'Reilly Media, 2012. EBSCOhost, login.proxy.libraries.rutgers.edu/login?url=https://search-ebscohost-com.proxy.libraries.rutgers.edu/login.aspx?direct=true&db=nlebk&AN=495822&site=eds-live.