# Test Your Software Installation

If this is your first time using a Jupyter notebook, please make sure to follow along with me in the class Video.  If you know what you are doing, just go ahead and run the cells and make sure everything works on your system.

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt

%matplotlib inline

###  No errors yet? Your software is installed!
If you don't have any errors from running the cell above, then you are set.

# Numpy

Numpy stands for numerical python.  It's giving us a LOT of very special things.
  * linear algebra
  * uses special libraries for your CPU to do the linear algebra routines FAST
  * gives us ndarrays -- n-dimensional arrays.
  
Numpy is the basis of machine learning in python.  Without it -- you have nothing. Literally nothing.  Every tool we use will use numpy, keras, tensorflow, pytorch, pymc3, pandas, scikit-learn, scikit-image, every single machine learning library you will ever find in python stands on top of numpy.

So let's see a tiny bit of what it can do

Examples taken from:
https://jakevdp.github.io/PythonDataScienceHandbook/

In [3]:
# Shape, Ndmin

np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

In [4]:
x1

array([5, 0, 3, 3, 7, 9])

In [5]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [6]:
x3

array([[[8, 1, 5, 9, 8],
        [9, 4, 3, 0, 3],
        [5, 0, 2, 3, 8],
        [1, 3, 3, 3, 7]],

       [[0, 1, 9, 9, 0],
        [4, 7, 3, 2, 7],
        [2, 0, 0, 4, 5],
        [5, 6, 8, 4, 1]],

       [[4, 9, 8, 1, 1],
        [7, 9, 9, 3, 6],
        [7, 2, 0, 3, 5],
        [9, 4, 4, 6, 4]]])

In [7]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


In [8]:
print("dtype:", x3.dtype)


dtype: int64


In [9]:
a  = np.arange(1,10)
print(f"The shape is {a.shape}")
print(a)

The shape is (9,)
[1 2 3 4 5 6 7 8 9]


In [10]:
a.shape

(9,)

In [11]:
b = a.reshape(-1,1)

In [12]:
b.shape

(9, 1)

In [13]:
b

array([[1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [14]:
# Reshape
grid = np.arange(1, 10,.1).reshape((-1, 10))
print(grid)

[[1.  1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
 [2.  2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9]
 [3.  3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9]
 [4.  4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9]
 [5.  5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9]
 [6.  6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9]
 [7.  7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9]
 [8.  8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9]
 [9.  9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9]]


In [15]:
grid.shape

(9, 10)

# Numpy uses vectorization

Vectorization, uses Basic Linear Algebra Subroutines (BLAS), notably [strassens algorithm](https://youtu.be/ORrM-aSNZUs), and a bunch of other very cool things


In [16]:
def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output


values = np.random.randint(1, 10, size=5)
print(values)
print(compute_reciprocals(values))


[5 4 5 5 9]
[0.2        0.25       0.2        0.2        0.11111111]


In [17]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

1.34 s ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
values

array([5, 4, 5, 5, 9])

In [19]:
values * 10

array([50, 40, 50, 50, 90])

In [20]:
print(compute_reciprocals(values))
print(1.0 / values)

[0.2        0.25       0.2        0.2        0.11111111]
[0.2        0.25       0.2        0.2        0.11111111]


In [21]:
%timeit (1.0 / big_array)

1.22 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Broadcasting

In [22]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b

array([5, 6, 7])

In [23]:
a + 5


array([5, 6, 7])

In [24]:
M = np.ones((3, 3))
M

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [25]:
M + a

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

In [26]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

[0 1 2]
[[0]
 [1]
 [2]]


In [27]:
a + b

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

# Let's plot a sine wave with numpy and matplotlib

You can adjust the range of the wave by playing around with the `np.arange()` parameters, the syntax is `(start, stop, step)` just like normal python slicing.  Except with step we can step by decimal amounts.

In [None]:
a = np.sin(np.arange(0,20,.1))
x = np.arange(0,20,.1)
plt.plot(x, a);  # we add the semi-colon ; in order to suppress an object output from jupyter -- go ahead and try removing it, you will see!

# We can dress up our plot a bit with extra attributes

In [None]:
plt.plot(x, a)
plt.title("This is a Sine Wave")
plt.xlabel("These are the values of X")
plt.ylabel("Y LABEL!!");

If we want to control the size of the plot we have to do it _before_ we make the plot, at least that's one way.

In [None]:
# figsize allows us to control the size of the plot, and we access it through the plots figure object

plt.figure(figsize=(10,5))
plt.plot(x, a)
plt.title("This is a Sine Wave")
plt.xlabel("These are the values of X")
plt.ylabel("Y LABEL!!");

## Finally let's plot a few other things on the same plot.


In [None]:
plt.figure(figsize=(10,5))
b = np.cos(np.arange(0,20,.1))
plt.plot(x, a, label = "sine", marker = 'o' ) # add a label so we can create a legend and mess with the marker
plt.plot(x, b, label = 'cos', linewidth = 4, color = 'purple') # add b by simply plotting it as well. , make it thicker and purple
plt.title("This is a Sine AND a Cosine Wave")
plt.xlabel("These are the values of X")
plt.ylabel("Y LABEL!!")
plt.legend();

# Ok, let's load up some data with Scikit-Learn and Pandas
We will use some built-in datasets from Scikit-learn, later on we will learn to load our own data

In [None]:
boston = datasets.load_boston()

In [None]:
boston.feature_names

# Push our data into a pandas DataFrame for ease of use

In [None]:
housing = pd.DataFrame(boston.data, columns = boston.feature_names)

In [None]:
# look at the shape of your data, presents as (rows , columns)
housing.shape

In [None]:
# bottom 5 rows of the dataset
housing.tail()

In [None]:
# show the first 5 rows of data by default
housing.head()

In [None]:
housing['CRIM'].mean()

In [None]:
housing.mean(axis = 0)

In [None]:
housing.skew()

In [None]:
# some basic stats on our numerical data
housing.describe()

# Some Indexing

Pandas has slicing built in, the same way python lists work (and numpy arrays).  This is done using the `[]` notation.  Additionally Pandas has some extra tricks with two main types of indexing `.iloc` which is primarily label based, and `.loc` which is primarily integer based.
You can read more [at the official docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

In [None]:
housing[:3]  # gives first 3 rows of the dataset

In [None]:
housing[3:11:2]  #rows 3 through 11, stepping by 2, note the start in inclusive and the end is excluded (like python)

### `.iloc`
Ok, that works for rows, but as soon as we want columns in pandas we need to switch `.iloc`

In [None]:
housing.iloc[:3,:2] # the first 3 rows and 2 columns -- note the comma ',' which used to tell pandas that we are indexing both rows and columns

In [None]:
# note we can do the same things we did before with `.iloc` as well
housing.iloc[:2]

### `.loc`

Here we will use the based indexing.  We can use it to select columns by their string name.  
It's important to note that we will index with `.loc[:4, [strings]]` and that  4 is a label here, it's the index label, which happens to be an integer (it's an integer most of the time)


In [None]:
housing.loc[:4, ['ZN', 'TAX']]

### Boolean indexing

So a common thing we may want to do is look for certain rows (or columns) that hold a certain value.  This can be done with boolean indexing easily with pandas.
This is best thought of as a two step process
(1) create a boolean _mask_ 
(2) use your mask to _index_ your dataframe.

Let's answer the question "Show me the rows where the age of the home is greater than 65"

In [None]:
housing.head()

In [None]:
# step one, create the mask.
mask = housing['AGE']>65
mask

In [None]:
# Now I can use my boolean mask to index on the dataframe.  This is obviously a bit more complicated under the hood -- but it works fabulously.
housing[mask]

In [None]:
# You will commonly see this pattern show up like this
housing[housing['AGE']>65]

# A pandas Series is like a DataFrame, but for a single column

All the same rules apply, plus there are a few neat inbuilt functions that are available only on Series, but they are mostly the same

In [None]:
housing_targets = pd.Series(boston.target)

In [None]:
housing_targets

# Finally, let's load at some data we can "look" at

In [None]:
digits = datasets.load_digits()

In [None]:
digits.data[0].shape  # shape tells us the dimension of our data, it's a 1D vector with 64 rows.

In [None]:
digits.data[0].reshape(8,8)

In [None]:
plt.imshow(digits.data[0].reshape(8,8), cmap='gray')  #we reshaped it into an 8,8 in order to plot it.

In [None]:
digits.data.shape  #we have 1797 samples, each one is a vector of 64 rows.  In this case we have 1797 rows, each row is a row vector of length 64.

In [None]:
fig, axes = plt.subplots(2, 10, figsize=(12, 2))

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.data[i].reshape(8,8), cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')
    ax.set_xticks([])
    ax.set_yticks([])