#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Visualizations

[NumPy](http://www.numpy.org/) and [Matplotlib](https://matplotlib.org/) are two packages that you'll use regularly throughout your machine learning code. These packages are tightly integrated with [Pandas](https://pandas.pydata.org/) and other machine learning libraries.

NumPy is a package for scientific computing with Python. It makes working with multidimensional arrays easy and contains many useful linear algebra algorithms.

Matplotlib is a library for creating two-dimensional charts. It is integrated with Pandas and can display charts in-line in Jupyter notebooks and Colab.

## Overview

### Learning Objectives

* TODO(joshmcadams)

### Prerequisites

* TODO(joshmcadams)

### Estimated Duration

60 minutes

## NumPy

Python has a built in list data type. Lists are simply sequences of other pieces of data. The elements contained in a list could be of any data type that Python supports, including other lists.

Lists are "container" data types. This means that they can contain other types of data. Dictionaries and tuples are also containers.

A container can contain another container data type. This is referred to as "nesting". Nesting of data structures can be arbitrarily deep.

In [0]:
my_list = [         # outermost list begins
  ['a', 'b', 'c'],  # sub-list
  ['x',             # another sub-list
    ['y', 'z'],     # a sub-list two levels deep
  ],
]

print(my_list)

Python lists are powerful data structures, but for scientific computing and machine learning, NumPy provides an even more powerful structure: `numpy.array`.

A `numpy.array` is an n-dimensional array. n can be 1 (similar to a flat list) or more.

`numpy.array` also comes with utilities for performing mathematical operations to every element in the array, linear algebra, and manipulating the dimensions of the array.

There are many ways to create an `numpy.array`, but one of the most straightforward is to use a Python list:

In [0]:
import numpy as np

np.array([4, 5, 6])

You can even pass in nested lists:

In [0]:
import numpy as np

np.array(
  [
    [9, 8, 7],
    [6, 5, 4],
    [3, 2, 1],
  ]
)

You can see the dimensions of the array by asking for its shape.

In [0]:
import numpy as np

my_array = np.array(
  [
    [9, 8, 7, 6, 5],
    [4, 3, 2, 1, 0],
  ]
)

print(my_array.shape)

If you have a one-dimensional list and would like to convert it to a multidimensional `numpy.array`, you can just tell NumPy to reshape the array.

In [0]:
import numpy as np

# Create a one-dimensional array with 21 elements
my_array = np.array(range(21))
print("Starting shape: {}".format(my_array.shape))
print(my_array)

# Reshape the array into a two-dimensional array
my_array = my_array.reshape(3, 7)
print("New shape: {}".format(my_array.shape))
print(my_array)

There are convenience initializers for initializing an n-dimensional array with the same value in every element. In the example below we create a three-dimensional array with every element containing zeros. There are also a method for initializing a matrix with all [ones](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ones.html#numpy.ones), [empty](https://docs.scipy.org/doc/numpy/reference/generated/numpy.empty.html#numpy.empty) values, from a [file](https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html#numpy.fromfile), and more. Check out the [NumPy documentation](https://docs.scipy.org/doc/) for the full list of ways to create and initialize a `numpy.array`.

In [0]:
import numpy as np

my_array = np.zeros((3,4,5))

print(my_array)

You might have noticed in the call to `zeros` above that we passed in a tuple with the dimensions of the array that we wanted. `zeros` and many other NumPy functions that accept dimensions typically expect those dimensions to be provided as a list or tuple.

For example, this is okay:

```
  np.zeros((2, 3))
```

As is this:

```
  np.zeros([2, 3])
```

But this is not:

```
  np.zeros(2, 3)
```

This is because these functions expect their first parameter to be the full set of dimensions. Packaging those dimensions in a list or tuple allows NumPy to know exactly which parameters are dimensions.

You might have also noticed that our zeros were all floating point numbers. What if you wanted integers or some other data type? For that you can add a named `dtype` parameter.

In [0]:
import numpy as np

my_array = np.zeros((2,2,2,2), dtype='int')

print(my_array)

It can also be useful to create arrays of random numbers. There are numerous functions for doing this, each with different characteristics for the numbers they generate.

[numpy.random.random](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.random.html) generates floating point numbers between [0.0, 1.0) that are uniformly distributed while [numpy.random.standard_normal](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.standard_normal.html#numpy.random.standard_normal) generates numbers in a standard distribution. Check out [the documentation](https://docs.scipy.org/doc/numpy/reference/routines.random.html) for other functions for generating random numbers.

In [0]:
import numpy as np

ran = np.random.random((2,3))
std = np.random.standard_normal((2,3))

print(ran)
print(std)

Now that we know how to create a `numpy.array` we can move on and start using the array.

One of the simplest ways to use a NumPy array is to perform some mathematical operation on every element in the array. In the example below we create a two-dimensional array of zeros and then add one to every element in the array.

Think about how you would do this with a list of lists in Python. You'd need nested for loops to iterate through each element of the data structure. With NumPy it is as simple as a `+`.

Other operators such as `*`, `-`, `/`, and `**` also work elementwise on each item in the array.

In [0]:
import numpy as np

my_array = np.zeros([4,4])

my_array = my_array + 1

print(my_array)

NumPy also allows you to perform mathematical operations between multiple arrays. The `+`, `-`, `/`, and `*` perform element-wise operations on like-sized arrays.

In [0]:
import numpy as np

array_one = np.array(range(1, 7))
print(array_one, '\n')

array_one = array_one.reshape(3,2)
print(array_one, '\n')

array_two = np.array(range(1, 7)).reshape(3,2)

print(array_one / array_two, '\n')
print(array_one + array_two)


There are even linear algebra functions. One of the most common is the dot product. The dot product can be performed with the `@` operator:

In [0]:
import numpy as np

array_one = np.array(range(1, 7)).reshape(3,2)
array_two = np.array(range(1, 7)).reshape(2,3)

array_one @ array_two

Though you might see the `@` operator in code, it is more clear to use the `dot` function to get the dot product.

In [0]:
import numpy as np

array_one = np.array(range(1, 7)).reshape(3,2)
array_two = np.array(range(1, 7)).reshape(2,3)

array_one.dot(array_two)

There is also a `transpose` function. You might also see it in code as `T` (for example, `array.T`).

Also note the use of the `arange` function instead of `array(range(...))`. `arange` is a wrapper around `range` to make it a top-level component in NumPy. It is called in a similar manner to Python's built-in `range` function.

In [0]:
import numpy as np

array = np.arange(12).reshape(4,3)
print(array, '\n')

array = array.transpose()
print(array)

## Matplotlib

Matplotlib is a great way to visualize datasets. In this Colab we will generate some artificial data sets with NumPy and then use Matplotlib to visualize them.

We will start with a dataset that has a linear pattern to its data. We will randomly create the `x` data and then use a linear equation to derive the `y` data.

In [0]:
import numpy as np

# The size of the dataset that we'll be using to perform our linear regression.
DATA_SET_SIZE = 1000

# The maximum value of the x coordinate. The range of values of X will be
# (0, X_MAX).
X_MAX = 5

# The Y-intercept is one of the "secret" values that we'll be trying to predict
# via linear regression.
INTERCEPT = 4

# The slope is another value that we'll be trying to predict using linear
# regression.
SLOPE = 3

# Generate the x-coordinates for our dataset 
x = X_MAX * np.random.rand(DATA_SET_SIZE, 1)

# Generate the y-coordinates for our dataset using the linear equation
# y = mx + b.
y = SLOPE * x + INTERCEPT

We can then import Matplotlib and use it to plot the data.

We import the `pyplot` subpackage of `matplotlib` and then use it to plot the `x` and `y` data. The `'b.'` tells `pyplot` to use blue dots to represent each data point.

In [0]:
import matplotlib
import matplotlib.pyplot as plt

plt.plot(x, y, 'b.')
plt.show()

The data does indeed have an x range from zero to our max x value. Notice that the y-intercept and slope match our seeded values.

This data looks nothing like we'd see in the real world though. Let's add a little randomness to the data to make it more realistic.

In [0]:
y = y + 2 * np.random.randn(DATA_SET_SIZE, 1)

plt.plot(x, y, 'b.')
plt.show()

We will now draw a red line representing the actual linear equation used to build this data. Notice that we use `'r-'` to tell `pyplot` that we want a red line.

In [0]:
plt.plot(x, y, 'b.')
(low_x, high_x) = (0, 5)
plt.plot([low_x, high_x], [SLOPE * low_x + INTERCEPT, SLOPE * high_x + INTERCEPT], 'r-')
plt.show()

Matplotlib is not limited to line graphs and scatterplots. The library supports many different chart types. Check out the [gallery](https://matplotlib.org/gallery.html) to see many of them in action.

Below is an example of a simple bar chart. The code doesn't look too much different than the code for the scatterplot and line chart above.

In [0]:
import numpy as np
import matplotlib.pyplot as plt

bar_count = 5

x = np.arange(bar_count)
y = np.random.rand(bar_count)

plt.bar(x, y, color='green')
plt.show()

You can access [subplots](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html) of the current figure. Any figure can have one or more subplots. In the example below we got access to the one-and-only subplot of the figure and created a horizontal bar chart with labels and titles.

In [0]:
import numpy as np
import matplotlib.pyplot as plt

_, ax = plt.subplots()

bar_count = 4
y = np.arange(bar_count)
x = np.random.rand(bar_count)
x_error = np.random.rand(bar_count) / 10

ax.barh(y, x, xerr=x_error, align='center', color='blue', ecolor='red')
ax.set_yticks(y)
ax.set_yticklabels(['A', 'B', 'C', 'D'])
ax.set_xlabel('Some Measure')
ax.set_title('This is a title')

plt.show()

## Seaborn

Matplotlib is just one of the many visualization libraries to choose from.  Let's take a closer look at another one, [Seaborn](https://seaborn.pydata.org/introduction.html)

Seaborn is a library for making statistical graphics in Python. It is built on top of Matplotlib and closely integrated with Pandas data structures.

We will use `tips dataset` which is one of the example datasets from the [Seaborn data repository](https://github.com/mwaskom/seaborn-data).  

In [0]:
# Import Seaborn Python library
import seaborn as sns

# Load tips dataset
tips = sns.load_dataset("tips")

# Add a new column for percentage of tip from the total bill amount
tips['percent_tip'] = 100 * tips["tip"] / tips["total_bill"]

# Print out the first 10 rows
tips.head(10)

Let's plot the histogram of the `percent_tip` values.  [Seaborn.distplot()](https://seaborn.pydata.org/generated/seaborn.distplot.html) function automatically calculates a good default bin size to produce a histogram.

In [0]:
# Set the grid background color to white
sns.set(style="whitegrid")

# Plot histogram of the percent_tip values
sns.distplot(tips["percent_tip"], vertical=True)

Now, a histogram plot combine values into bins or buckets resulting in frequency data for each bins.  What if you'd like to see each of the values plotted separately?  This only makes sense when the dataset is reasonably small, otherwise the resulting graph might be overwhelming and not useful.

It can be accomplished by charting using [seaborn.swarmplot()](https://seaborn.pydata.org/generated/seaborn.swarmplot.html) function.

In [0]:
ax = sns.swarmplot(y=tips["percent_tip"])

Looking at the plot above, you can get sense of the values distribution and can still see where individual values land.  However, `seaborn.swarmplot()` is especially useful when you're trying to segment the data along categories.

The chart below splits the data over category of time of dining: Lunch or Dinner.

In [0]:
ax = sns.swarmplot(x="time", y="percent_tip", data=tips)

Here's another example of the data being split across days of the week, rather than time of day.

In [0]:
ax = sns.swarmplot(x="day", y="percent_tip", data=tips)

Lastly, here's a more complex example where the previous chart is augmented with a second data categorization - in this case the gender of the person.

In [0]:
ax = sns.swarmplot(x="day", y="percent_tip", hue="sex", palette=["r", "c"], data=tips)

# Exercises

## Exercise 1

1. Create a 5 x 6 x 2 matrix filled with random floating point numbers in a normal distribution.
2. Multiply every data point in the matrix by 10.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
import numpy as np

# 1
normal_dist = np.random.randn(5,6,2)

# 2
normal_dist = normal_dist * 10

normal_dist

**Validation**

In [0]:
# TODO

## Exercise 2

1. Convert `list_one` and `list_two` into NumPy arrays.
2. Transpose the first matrix.
3. Find the dot product of the first matrix by the second matrix.

### Student Solution

In [0]:
list_one = [
  [5, 6, 1, 5],
  [1, 2, 9, 0],
  [3, 2, 6, 1], 
]

list_two = [
  [4, 3, 1, 8],
  [3, 9, 0, 5],
  [2, 8, 4, 0],
]

# Your code goes here

### Answer Key

**Solution**

In [0]:
# 1
list_one = np.array(list_one)
print(list_one, '\n')
list_two = np.array(list_two)
print(list_two, '\n')

# 2
list_one = np.transpose(list_one)
print(list_one, '\n')

# 3
print(list_one.dot(list_two))

**Validation**

In [0]:
# TODO

## Exercise 3

1. Generate a dataset of 1000 x, y data points based on a polynomial function.
1. Add some randomness to the dataset.
1. Use matplotlib to plot the points as green dots in a scatterplot.
1. Draw the polynomial line as a black line on the chart.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import random

# 1. Generate a dataset of 1000 x, y data points based on a polynomial function.
DATA_SET_SIZE = 1000
X_MAX = 5

x = X_MAX * np.random.rand(DATA_SET_SIZE, 1)
f = lambda x: x**3
y = f(x)

# 2. Add some randomness to the dataset.
y = y + 20 * np.random.randn(DATA_SET_SIZE, 1)

# 3. Use matplotlib to plot the points as green dots in a scatterplot.
plt.plot(x, y, 'g.')

# 4. Draw the polynomial line as a black line on the chart.
x_range = np.linspace(0, X_MAX, 1000)
plt.plot(x_range, f(x_range), 'k-')
plt.show()


**Validation**

In [0]:
# TODO

## Exercise 4

Create a box-and-whisker plot of the polynomial dataset created in exercise three. Have one plot for x and one for y. These plots show the domain and range of the two axes of our dataset.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(ncols=2)
ax1.set_title('X')
ax1.boxplot(x)

ax2.set_title('Y')
ax2.boxplot(y)


**Validation**

In [0]:
# TODO

## Exercise 5: Challenge (Ungraded)

1. Visit data.gov's CSV datasets and find "Demographic Statistics By Zip Code: City of New York — Demographic statistics broken down by zip code".
1. Load that data into your Colab using Pandas.
1. Create a heatmap of one of the columns overlayed on a map of the city of New York

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO