## Getting started with Jupyter Notebooks

Before we begin, let's cover a few basics about Jupyter (also known as iPython) notebooks. This is a markdown cell. Select a cell (a blue boundary should appear around the cell; you're in command mode) and press Enter on your keyboard to see it in editing mode (the cell boundary turns green in color). To switch back to command mode, press Esc. To run the cell, and go to the next cell, press Shift+Enter. If you want to run the cell without advancing to the next, press Ctrl+Enter.

To create a new cell below the current one, switch to command mode and then press the letter "b" on your keyboard. "a" will create a new cell above the current cell. "dd" will delete the current cell.

## Introduction

In this workshop, we'll explore the NumPy and pandas libraries for data analysis. NumPy is a scientific computing library often used for its fast array-based operations, and pandas is a data analysis library commonly used for manipulation/analysis of tabular (e.g. CSV-formatted) data.

## NumPy Arrays (ndarrays)

In [1]:
## CODE CELL 1

import numpy as np

In [2]:
## CODE CELL 2

arr = np.array([1,3,5,7])
arr

array([1, 3, 5, 7])

Why ndarrays?
- efficient vectorized, elementwise operations for homogeneous data (sometimes orders of magnitude faster than in "pure Python")
- provides foundation for operations in **pandas**

### Creating an ndarray

In [3]:
## CODE CELL 3
# We already saw this

np.array([1,3,5,7])

array([1, 3, 5, 7])

In [4]:
## CODE CELL 4
# Similar to range() function used in for loops and other places

np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [5]:
## CODE CELL 5
# Random numbers sampled from Gaussian distribution, mean = 0 and variance = 1
# Using different mean/variance: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html

np.random.randn(5)

array([-0.6742499 ,  0.59940781, -1.57837259,  0.92996841,  0.40355725])

In [6]:
## CODE CELL 6
# 2-D version

np.random.randn(5,2)

array([[-1.26295801,  0.02332286],
       [-0.67562393, -0.08328262],
       [ 1.95087345,  0.61237214],
       [-0.73844885,  0.73327816],
       [ 0.45135441, -0.89444135]])

In [7]:
## CODE CELL 7
# All ones

np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [8]:
## CODE CELL 8
# All zeros

np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [9]:
## CODE CELL 9
# From a text file

my_arr = np.loadtxt('loadarray.txt')
my_arr

array([[ 1.45,  3.46],
       [ 2.36,  6.43],
       [ 4.65, 10.24],
       [ 5.43,  3.25]])

### Investigating an ndarray

In [10]:
## CODE CELL 10
# Number of dimensions

my_arr.ndim

2

In [11]:
## CODE CELL 11
# Shape

my_arr.shape

(4, 2)

In [12]:
## CODE CELL 12
# Data type

my_arr.dtype

dtype('float64')

### Casting from one dtype to another

In [13]:
## CODE CELL 13
# Let's get a copy of my_arr as an integer array

my_intarr = my_arr.astype(int)
my_intarr
#rounded_arr = np.rint(my_arr)    # numbers rounded instead of truncated (uncomment this and the next line to run code)
#rounded_arr

array([[ 1,  3],
       [ 2,  6],
       [ 4, 10],
       [ 5,  3]])

Similarly, you can cast an array from/into a float or string.

In [14]:
## CODE CELL 14

years = np.array(['2000','1999','2003','1998','1999','2004'])
years.astype(int)

array([2000, 1999, 2003, 1998, 1999, 2004])

### Indexing and slicing

For 1D arrays, slicing and indexing are similar to corresponding operations on lists.

In [15]:
## CODE CELL 15

sales = np.array([20, 30, 31, 33, 33, 35, 40, 410, 410, 45])
sales[7:9]

array([410, 410])

However, if a scalar value is assigned to a slice, the value is broadcasted to the original array

In [16]:
## CODE CELL 16

sales_slice = sales[7:9]
sales_slice[:] = 41
sales

array([20, 30, 31, 33, 33, 35, 40, 41, 41, 45])

What if we have a 2D array?

In [17]:
## CODE CELL 17

my_arr

array([[ 1.45,  3.46],
       [ 2.36,  6.43],
       [ 4.65, 10.24],
       [ 5.43,  3.25]])

In [18]:
## CODE CELL 18

my_arr[0,1]

3.46

In [19]:
## CODE CELL 19

my_arr[:2, 0]

array([1.45, 2.36])

For a 2D array, the concept of axis0 vs. axis1 is something you'll want to be familiar with, especially once we move into pandas. (See this Stack Overflow discussion for some helpful perspectives: https://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean)

### Mathematical and statistical operations, functions, and methods

Remember, *elementwise* operations.

In [20]:
## CODE CELL 20

arr1 = np.arange(10)
arr2 = arr1 + 2

In [21]:
## CODE CELL 21

arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [22]:
## CODE CELL 22

arr2

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [23]:
## CODE CELL 23

arr1 + arr2

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In [24]:
## CODE CELL 24

arr2 * arr2

array([  4,   9,  16,  25,  36,  49,  64,  81, 100, 121])

In [25]:
## CODE CELL 25

arr2 ** 0.5

array([1.41421356, 1.73205081, 2.        , 2.23606798, 2.44948974,
       2.64575131, 2.82842712, 3.        , 3.16227766, 3.31662479])

In [26]:
## CODE CELL 26

1/arr2

array([0.5       , 0.33333333, 0.25      , 0.2       , 0.16666667,
       0.14285714, 0.125     , 0.11111111, 0.1       , 0.09090909])

**Universal functions, or ufuncs:**

In [27]:
## CODE CELL 27
# Equivalent to arr2 * arr2

np.square(arr2)

array([  4,   9,  16,  25,  36,  49,  64,  81, 100, 121])

In [28]:
## CODE CELL 28
# Equivalent to arr2 ** 0.5

np.sqrt(arr2)

array([1.41421356, 1.73205081, 2.        , 2.23606798, 2.44948974,
       2.64575131, 2.82842712, 3.        , 3.16227766, 3.31662479])

In [29]:
## CODE CELL 29
# Absolute values

mixed_arr = np.array([-1,3,2,-3,5,4,3,-6,-4,5])
np.abs(mixed_arr)

array([1, 3, 2, 3, 5, 4, 3, 6, 4, 5])

In [30]:
## CODE CELL 30
# Signs

np.sign(mixed_arr)

array([-1,  1,  1, -1,  1,  1,  1, -1, -1,  1])

In [31]:
## CODE CELL 31
# Comparing two arrays to get elementwise maxima

zeros_arr = np.zeros(10)
np.maximum(mixed_arr, zeros_arr)

array([0., 3., 2., 0., 5., 4., 3., 0., 0., 5.])

This is only a small sample of the many ufuncs that are out there - for more ufuncs, check out the documentation: https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs

### Array methods

In [32]:
## CODE CELL 32

my_arr

array([[ 1.45,  3.46],
       [ 2.36,  6.43],
       [ 4.65, 10.24],
       [ 5.43,  3.25]])

In [33]:
## CODE CELL 33
# Sum of all elements

my_arr.sum()

37.269999999999996

In [34]:
## CODE CELL 34
# Arithmetic mean

my_arr.mean()

4.6587499999999995

In [35]:
## CODE CELL 35
# Standard deviation (optionally, adjust degrees of freedom used in calculation via ddof parameter)

my_arr.std()

2.5952959248417127

In [36]:
## CODE CELL 36
# Variance (ddof adjustable)

my_arr.var()

6.7355609375

In [37]:
## CODE CELL 37
# Maximum of all elements

my_arr.max()

10.24

In [38]:
## CODE CELL 38
# Minimum of all elements

my_arr.min()

1.45

In [39]:
## CODE CELL 39
# What if I want the maximum value in each row?

my_arr.max(axis=1)

array([ 3.46,  6.43, 10.24,  5.43])

In [40]:
## CODE CELL 40
# Finding the indices of the maximum element of the array

my_arr.argmax()

5

How do we interpret this result?

In [41]:
## CODE CELL 41
# for > 1D, need to "unravel" the index to get a more informative result

np.unravel_index(my_arr.argmax(), my_arr.shape)

(2, 1)

For more, check out the "Methods" subsection here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html

### Masking and more with booleans (True/False values)

In [42]:
## CODE CELL 42
# Creating an array of the top 40 U.S. cities by population

top40_arr = np.array(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia', 'Phoenix', 'San Antonio', 'San Diego',
         'Dallas', 'San Jose', 'Austin', 'Jacksonville', 'San Francisco', 'Indianapolis', 'Columbus', 'Fort Worth',
         'Charlotte', 'Seattle', 'Denver', 'El Paso', 'Detroit', 'Washington', 'Boston', 'Memphis', 'Nashville', 'Portland',
         'Oklahoma City', 'Las Vegas', 'Baltimore', 'Louisville', 'Milwaukee', 'Albuquerque', 'Tucson', 'Fresno', 'Sacramento',
         'Kansas City', 'Long Beach', 'Mesa', 'Atlanta', 'Colorado Springs'])

In [43]:
## CODE CELL 43
# Which cities start with a letter in the second half of the alphabet?

top40_arr >= 'N'

array([ True, False, False, False,  True,  True,  True,  True, False,
        True, False, False,  True, False, False, False, False,  True,
       False, False, False,  True, False, False,  True,  True,  True,
       False, False, False, False, False,  True, False,  True, False,
       False, False, False, False])

In [44]:
## CODE CELL 44
# Creating a second array of the same length

new_arr = np.random.randn(40,2)
new_arr

array([[ 0.12103525,  0.16131923],
       [-0.35259555, -0.55279145],
       [-0.75223008, -1.93770704],
       [-0.13241254, -0.08311529],
       [-0.67964531,  1.17099578],
       [-2.10239515, -0.64597733],
       [ 1.0280561 , -0.00695737],
       [-0.62142915, -0.61024129],
       [-1.35816256,  0.40074095],
       [ 0.28900143,  0.52395531],
       [ 2.18617459,  0.71778324],
       [ 0.16654649,  1.11357703],
       [-0.54194962,  1.56947609],
       [ 0.06470212, -1.20748622],
       [-1.56185734,  0.5121815 ],
       [-1.05978528,  0.02042953],
       [ 0.78191041,  0.64630957],
       [ 1.77675705, -0.31605832],
       [ 1.83028327, -1.61548063],
       [-0.21741759, -0.53610538],
       [-1.75744395, -0.12900563],
       [-0.53690436,  0.64259293],
       [-0.16983497,  0.25127686],
       [-0.32393802,  1.10888898],
       [-1.42580266,  1.64214011],
       [ 0.89395588, -0.62761174],
       [-1.07535551,  0.37065945],
       [-0.22811201,  1.33900666],
       [ 0.78405152,

In [45]:
## CODE CELL 45
# Using the boolean array as a mask for a second array

mask = top40_arr>='N'
new_arr[mask]

array([[ 0.12103525,  0.16131923],
       [-0.67964531,  1.17099578],
       [-2.10239515, -0.64597733],
       [ 1.0280561 , -0.00695737],
       [-0.62142915, -0.61024129],
       [ 0.28900143,  0.52395531],
       [-0.54194962,  1.56947609],
       [ 1.77675705, -0.31605832],
       [-0.53690436,  0.64259293],
       [-1.42580266,  1.64214011],
       [ 0.89395588, -0.62761174],
       [-1.07535551,  0.37065945],
       [ 1.66976037, -0.6652307 ],
       [-2.67074776, -0.8239936 ]])

In [46]:
## CODE CELL 46
# Compound masks also work

mask2 = (top40_arr>='N')&(top40_arr<='R')
new_arr[mask2]

array([[ 0.12103525,  0.16131923],
       [-0.67964531,  1.17099578],
       [-2.10239515, -0.64597733],
       [-1.42580266,  1.64214011],
       [ 0.89395588, -0.62761174],
       [-1.07535551,  0.37065945]])

In [47]:
## CODE CELL 47
# Checking if there are any top-40 cities that start with a letter after "X"

maskP = top40_arr>'X'
maskP.any()

False

In [48]:
## CODE CELL 48
# Checking if all top-40 cities in the array are in title case

maskCapit = np.chararray.istitle(top40_arr)
maskCapit.all()

True

### Some matrix operations (just the tip of the iceberg)

In [49]:
## CODE CELL 49
# Creating a matrix using an ndarray object

mat = np.array([[2,5],
                [6,7]])

In [50]:
## CODE CELL 50
# Tranposing the matrix

mat.T

array([[2, 6],
       [5, 7]])

In [51]:
## CODE CELL 51
# Calculating X'X where X' is the transpose of X

matT = mat.T
matT.dot(mat)    # you can also use "np.dot(matT,mat)"

array([[40, 52],
       [52, 74]])

In [52]:
## CODE CELL 52
# 2x2 identity matrix

np.identity(2)

array([[1., 0.],
       [0., 1.]])

In [53]:
## CODE CELL 53
# Finding the inverse of the matrix

np.linalg.inv(mat)

array([[-0.4375,  0.3125],
       [ 0.375 , -0.125 ]])

In [54]:
## CODE CELL 54
# Finding the determinant of the matrix

np.linalg.det(mat)

-15.999999999999998

Other linear algebra methods can be found here: https://docs.scipy.org/doc/numpy/reference/routines.linalg.html

**Exercise 1:**

The Iris dataset is a well-known data source for teaching machine learning classification algorithms. There are four non-class attributes: sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm). Each row corresponds to measurements from one iris plant.
Using the provided array, calculate the minimum, maximum, mean, and standard deviation for each attribute (you can assume the order of attributes above reflects the order of the columns in the array). Create a new array that excludes outliers (for this exercise, flowers with a value more than 2 standard deviations away from the mean for any of the attributes).

*Hint*: The axis can be specified for the `arr.any()` and `arr.all()` methods.

In [55]:
## CODE CELL 55

from sklearn import datasets

iris_data = datasets.load_iris()
iris_arr = iris_data.data

**Answer 1:**

In [56]:
## CODE CELL 56

iris_arr.shape    # how many rows are we working with?

(150, 4)

In [57]:
## CODE CELL 57

iris_arr.min(axis=0)    # minimum values

array([4.3, 2. , 1. , 0.1])

In [58]:
## CODE CELL 58

iris_arr.max(axis=0)    # maximum values

array([7.9, 4.4, 6.9, 2.5])

In [59]:
## CODE CELL 59

mean_vals = iris_arr.mean(axis=0)    # mean values
mean_vals

array([5.84333333, 3.054     , 3.75866667, 1.19866667])

In [60]:
## CODE CELL 60

std_vals = iris_arr.std(axis=0)    # standard deviations
std_vals

array([0.82530129, 0.43214658, 1.75852918, 0.76061262])

In [61]:
## CODE CELL 61
# To create a mask against outliers, first get boolean array differentiating outliers vs. non-outliers

notOutlier = (iris_arr <= mean_vals+2*std_vals)&(iris_arr >= mean_vals-2*std_vals)
notOutlier

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
 

In [62]:
## CODE CELL 62
# To use the mask, we need to flatten the array to 1D

compressed = notOutlier.all(axis=1)
compressed

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [63]:
## CODE CELL 63
# Finally, use the mask to exclude rows with outliers

iris_arr[compressed]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [4.9, 3.1, 1.5, 0.1],
       [5. , 3.2, 1.2, 0.2],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [4.4, 3

In [64]:
## CODE CELL 64
# What is the shape of the outlier-excluded array?

iris_arr[compressed].shape

(139, 4)

*Reference*:

The following material was consulted during development of this notebook, which loosely follows the structure of McKinney's chapter on NumPy Basics:

W. McKinney, "Chapter 4 - Numpy Basics: Arrays and Vectorized Computation," in *Python for Data Analysis : Data Wrangling with Pandas, Numpy, and IPython.* Sebastopol, CA: O'Reilly Media, 2012. [Online] Available: EBSCOhost, https://search-ebscohost-com.proxy.libraries.rutgers.edu/login.aspx?direct=true&db=nlebk&AN=495822&site=eds-live.