# Dummy

## Lab: Introduction to the python data science stack

In this lab, we will introduce some simple `python` commands and
introduce the `numpy` library.  The best way to learn a new language
is to try out the commands. As a general purpose
programming language, `python` was not created
specifically with data in mind so much of the data-specific
functionality comes from other packages, notably `numpy`
and `pandas`. To get
started, download and install a `python3` version from
[anaconda.com](http://anaconda.com).  For more resources about the
language in general, readers
may want to consult the [docs.python.org/tutorial](https://docs.python.org/3/tutorial/)
as well as the numpy tutorial
[docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/numpy/user/quickstart.html).

There are several interfaces to `python`. We will use
one of the most most popular<br />
– [Jupyter](http://jupyter.org) – which runs python
code through a file called a notebook.  For console work,
the [ipython](http://ipython.org) console is quite an improvement
over the standard console and can be installed with
`conda`. Displays of code below appear as if run in the
`Jupyter` rather than the default `python` console.

### Basic Commands

Like most languages, `python` uses *functions* 
to perform operations.   To run a
function called `funcname`, we type
`funcname(input1,input2)`, where the inputs (or *arguments*)
`input1` and `input2` tell
`python` how to run the function.  A function can have any number of
inputs. For example, to display a text representation
of something, we use the function
`print()` which displays all of its arguments
(represented as strings to the console). A string is a
non-numerical datatype: when a computer reads a file it is
reading in strings of characters. In `python` every object
has at least one representation as a string.



In [None]:
%%R
print('fit a model with', 11, 'variables')

In statistical learning, we will often work with vectors and matrices
of numbers. The builtin `python` functions do not support vectors and matrices,
these are introduced in `numpy`. However,
`python` does have sequences built in.
The following command instructs `python` to join together
the numbers 3, 4, and 5, and to save them as a
list named `x`. When we
type `x`, it gives us back the list.



In [None]:
%%R
x = [3, 4, 5]
x

Note that no function is called in creating `x` above. The
`[]` brackets denote the construction of a list in `python`.
Typing `funcname?` will cause |ipython| to display some
documentation for `funcname` when it exists.

We will often want to add two sets of numbers together. It is reasonable to try the following code,
though it will not produce the desired result.



In [None]:
%%R
y = [4, 9, 7]
x + y

The result may appear slightly counterintuitive – why did `python` not add the entries of the list
element-by-element? This brings us to the `numpy` library. In `python`, lists are basic `types`
that can hold *arbitrary* objects and addition of lists is defined as *concatenation*. Different
types can define addition differently. For example, addition of strings is concatenation,
while addition of integers results (sensibly) in an integer.



In [None]:
%%R
"hello" + " " + "world"

In [None]:
%%R
3 + 5

To add vectors of numbers, we must first tell `python` that the
objects are indeed vectors.
This is done by creating an `array` using
`numpy`



In [None]:
%%R
x = np.array([3, 4, 5])

Note that `python` is unable to find `np`. This is because
we have tried to access the function `array` in the library
`np` (common shorthand for `numpy`). However,
we have not imported the
the `numpy` library. In order to
access code in a package, it must be imported.



In [None]:
%%R
import numpy as np # standard name for the numpy library
x = np.array([3, 4, 5])
y = np.array([4, 9, 7])
x + y

The arrays `x` and `y` are one dimensional.
Arrays can be multidimensional, including more than 2 dimensions.
Matrices are usually represented as 2-dimensional arrays in `numpy`
(though there are *matrices* in `numpy`, most operations can be done
quite naturally on *arrays* instead).
In the labs of this book, we will use the
convention that a *matrix* is a two-dimensional `numpy` array while a
*vector* is a one-dimensional `numpy` array. Using arrays instead of matrices mean that there
is no distinction between vectors and matrices
besides the number of dimensions: there are is no need to distinguish
between row and column vectors.

Let’s learn a little more about the `np.array()` function:



In [None]:
%%R
np.array?#DONOTRUN

The help file reveals that the `np.array()` function takes a number of inputs, but
for now we focus on the first two: the
data and the data type or `dtype`.



In [None]:
%%R
x = np.array([[1, 2], [3, 4]])
x, x.dtype, x.ndim #CHOPLINE

In [None]:
%%R
np.array([[1, 2], [3.0, 4]]).dtype

In [None]:
%%R
np.array([[1, 2], [3, 4]], np.float).dtype

Arrays (i.e. `np.ndarray` objects) have several properties or
*attributes*  and *methods*.
 An *attribute* of an object
is a python object attched to another python object.
For instance `x.ndim` is the `ndim` attribute
of the object `x` which tells us we have
created a two dimensional array. A *method* of an object such as
`x` above is a function that takes `x` as its first
argument and is called as `x.methodname()`. For instance, the
`reshape` method returns a new array with the same data as
`x` but different shape. Users should be careful to note that
the data in `x` is not copied in this operation, so modifying
the reshaped array `x_reshape` will modify `x` as
well. When in doubt, one can use the `copy()` method.



In [None]:
%%R
x = np.array([1, 2, 3, 4])
print('beginning x', x)
x_reshape = x.reshape((2, 2))
print('reshaped x', x_reshape)
x_reshape[0, 0] = 5
print('reshaped and modified x', x_reshape)
print('after x', x)

Several things can be seen in this example. Displaying
`x_reshape` we see that `numpy` arrays are specified as a list
of *rows*. This is sometimes called *row-major ordering* rather than *column-major
ordering*.

We also see that `python` (and hence `numpy`) uses 0-based
indexing. That is, the entry `x[2]` would be `3`, the
third entry of `x`. Finally, we introduced a `tuple`
`(2, 2)` in our call to `reshape()`.  Like lists, tuples
are built in `python` objects that represent a sequence of
objects. The reader may wonder why `python` has more than one way to
create a sequence? One of the key differences between lists and tuples
is *mutability*: entries of a tuple cannot be modified, while entries
of a list can be modified.



In [None]:
%%R
my_list = [3, 4, 5]
my_list[0] = 2
my_list

In [None]:
%%R
my_tuple = (3, 4, 5)
my_tuple[0] = 2

The `shape` of an array is an attribute that is always a tuple.
Other useful attributes are `dim, T`, the transpose of the array:



In [None]:
%%R
(x_reshape.shape, x_reshape.ndim, x_reshape.T) #CHOPLINE

We will often want to apply functions to arrays. Often these functions are to be applied elementwise:



In [None]:
%%R
np.sqrt(x)

In [None]:
%%R
   x**2

In [None]:
%%R
   x**0.5

In our simulations later, we will often want to generate random data
to evaluate how a learning method performs.
The `np.random.normal()` function generates a vector of random
normal variables. Looking at `np.random.normal?` we see that
it takes arguments `loc, sd, size`. The signature (i.e. argument list)
of this function reads `(loc=0, sd=1, size=None)`. The
arguments here are known as *keyword* arguments with default
values provided. Note that each
time we call this function, we will get a different answer. Here we
create two correlated sets of numbers, `x` and `y`, and use
the `np.corrcoef` function to compute their correlation matrix with the [0,1]
entry being the correlation.



In [None]:
%%R
x = np.random.normal(size=50)
y = x + np.random.normal(loc=50, scale=1, size=50)
np.corrcoef(x, y)

By default, `np.random.normal` creates standard normal
random variables with a mean of 0 and a standard deviation of 1.
However, the mean and standard deviation can be altered
using the `loc` and `scale` arguments, as illustrated
above. Sometimes we want our code to reproduce the exact same set of
random numbers; we can use the `np.random.seed` function to do
this. The `np.random.seed` function takes an (arbitrary)
integer argument.



In [None]:
%%R
np.random.seed(1303)
np.random.normal(size=3, scale=5) 

In [None]:
%%R
np.random.normal(size=3, scale=5) # these will be different 

In [None]:
%%R
np.random.seed(1303)     
np.random.normal(size=3, scale=5) # these will match above

We use `np.random.seed()` throughout the labs whenever we
perform calculations involving random quantities.  In general this
should allow the user to reproduce our results. However, it should be
noted that as new versions of `numpy` become available it is possible
that some small discrepancies may occur between the output
revealed in the book and the output
from `numpy`.

The `np.mean()` and `np.var()` functions can be used
to compute the mean and variance of arrays.  These functions are also
available as methods on the arrays.



In [None]:
%%R
np.random.seed(3)
y = np.random.standard_normal(10)
np.mean(y), y.mean()
 

In [None]:
%%R
np.var(y), y.var(), np.mean((y - y.mean())**2)
 

In [None]:
%%R
np.sqrt(np.var(y)), np.std(y)
 

These functions can also be
computed along axes of an array if desired.
Let’s compute the mean of an matrix across
rows. As the array is row-major ordered, this
is the first axis, i.e. `axis=0`.



In [None]:
%%R
X = np.random.standard_normal((10, 3))
X.mean(axis=0)
   

It is not necessary to specify the argument `axis` as a
`keyword` argument. We get the same result with the following code.



In [None]:
%%R
X.mean(0)

Note that `numpy` uses $1/n$ as a denominator in computing standard deviation
rather than the commonly used unbiased estimate $1/(n-1)$. This can
be changed by using the argument `ddof` which is subtracted from
the length of array before dividing.



In [None]:
%%R
np.var(y, ddof=1) == np.sum((y - y.mean())**2) / (y.shape[0]-1)

### Graphics

Producing graphics is an important part of data analysis.
In `python` this is most commonly done through the library `matplotlib`.
As noted above, `python` was not written with data analysis in mind,
so the notion of plotting is not intrinsic to the language. This has the
effect that typical python objects do not necessarily have a natural way to be plotted.

#### Some preliminary details

We will want to be able to see plots in the notebook. We
will therefore include the following cell which tells
`jupyter` to include plots in the web frontend.

The line `%matplotlib` is known as a *magic* in |ipython| – it sets the behavior
of plotting windows so that interaction with the console is smoother.

The subpackage `pyplot` usually imported under the name `plt` in
provides a simple interface into the more abstract and powerful library
`matplotlib`. For common data analysis tasks, the [seaborn](https://seaborn.pydata.org) library provides
a nice high level interface to `matplotlib`. In this introductory
chapter, we stick to `plt`. For many more examples,
students are encouraged to visit [matplotlib.org/stable/gallery](https://matplotlib.org/stable/gallery/index.html).

Let’s start with the
basic  `plt.plot()` function.
To find out more information
type `plt.plot?`.
Having loaded the library, let’s plot some data.



In [None]:
%%R
import matplotlib.pyplot as plt # common name for pyplot
x = np.random.standard_normal(100)
y = np.random.standard_normal(100)
plt.plot(x, y);

We see that by default, the function `plt.plot()` produces
line plots where a scatter plot would seem more natural here.
A scatter plot can be produced with `plt.scatter()` or by
specifying an additional argument to `plt.plot()`.
We will also want to add
labels and titles to our plot. This can be done by capturing the current plotting `axis`
with `plt.gca()` and setting the appropriate attributes. The current plotting figure
can be found with `plt.gcf()`. Note that we are modifying the labels of the plot below *after* having
constructed the initial plot.



In [None]:
%%R
plt.plot(x, y, 'o')
ax = plt.gca()
ax.set_xlabel("this is the x-axis") 
ax.set_ylabel("this is the y-axis") 
ax.set_title("Plot of X vs Y");  

Let’s compare this to a call to `plt.scatter()`.



In [None]:
%%R
plt.scatter(x, y, marker='o')
ax = plt.gca()
ax.set_xlabel("this is the x-axis") 
ax.set_ylabel("this is the y-axis") 
ax.set_title("Plot of X vs Y");  

We will often want to display more than one plot in a given figure. This
can be achieved using a sequences of *axes*. Let’s replot the
previous two figures side-by-side in a single figure.



In [None]:
%%R
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
axes[0].plot(x, y, 'o')
axes[1].scatter(x, y, marker='+');

We will often want to save the output of a plot perhaps as a PNG or
PDF file. This can be done by saving the current figure.



In [None]:
%%R
fig.savefig("Figure.png", dpi=400)
fig.savefig("Figure.pdf", dpi=200);

Note that
`fig` was defined in the cell above it was saved. We can add to the
figure, or modify its axes and redisplay it.



In [None]:
%%R
axes[0].set_xlim([-1,1])
fig;

The function `np.linspace()` can be used to create a sequence
of numbers. For instance, `np.linspace(a, b, n)` makes a vector
of numbers starting at  `a` and  ending at `b` of length
`n`.  Another useful function is `np.arange` which is a
`numpy` version of the builtin function `range()`.
The reader may wonder why the terminal value `b` is not included in
`seq2` below: this is consistent with `python` slice notation
for sequences as seen in the slicing of the string in the snippet
below by the slice `2:5` which returns the third (`[2]`)
through fifth (`[4]`) character of “hello world” and does not
include the sixth character (`[5]`).



In [None]:
%%R
seq1 = np.linspace(0, 10, 11)
seq1
 

In [None]:
%%R
seq2 = np.arange(0, 10)
seq2

In [None]:
%%R
"hello world"[2:5]

We will now create some more sophisticated plots. The
`plt.contour()` function produces a contour_plot
in order to represent three-dimensional data; it is like a
topographical map.  It takes three arguments:


* A vector of the `x` values (the first dimension),


* A vector of the `y` values (the second dimension), and


* A matrix whose elements correspond to the `z` value (the third
dimension) for each pair of `(x,y)` coordinates.


As with the `plt.scatter()` function, there are many other
inputs that can be used to fine-tune the output of the
`plt.contour()` function. To learn more about these, take a
look at the help file by typing `?plt.contour`.



In [None]:
%%R
x = np.linspace(-np.pi, np.pi, 50)
y = x
f = np.multiply.outer(np.cos(y), 1 / (1 + x**2))
plt.contour(x, y, f);

In [None]:
%%R
plt.contour(x, y, f, levels=45);

In [None]:
%%R
fa = (f - f.T) / 2
plt.contour(x, y, fa, levels=15);

The `plt.imshow()` function works the same way as
`plt.contour()`, except that it produces a color-coded plot
whose colors depend on the `z` value. This is known as a
heatmap, and is sometimes used to plot temperature in
weather forecasts.



In [None]:
%%R
plt.imshow(fa);

### Indexing Data

We often wish to examine part of a set of data. Suppose that our data
is stored in the matrix `A` (recall that we defined a matrix
as a 2-dimensional `numpy` array for the purpose of these labs).



In [None]:
%%R
A = np.array(np.arange(16)).reshape((4, 4))
A #CHOPLINE

Then, typing `A[1,2]` will select the element corresponding to the second row and the third
column.



In [None]:
%%R
A[1, 2]

The first number after the open-bracket symbol `[`
always refers to the row (starting with 0 as the first row), and the
second number always refers to the column (again starting with 0). We
can also select multiple rows and columns at a time, by providing
vectors as the indices, though `numpy`’s indexing notation can get a
little complex. Let’s look at a few indexing operations. We may
want to select only a few rows, perhaps the 2nd and 4th rows:



In [None]:
%%R
A[[1,3]]  #CHOPLINE

Suppose now we want to select the 1st and 3rd columns. This
is selecting on the 2nd axis of the array. To indicate
this, we pass `[0,2]` as the second argument in the square brackets.
The first argument `:` indicates we will
select all rows.



In [None]:
%%R
A[:,[0,2]]  #CHOPLINE

Suppose we want to select the submatrix made up of the 2nd and 4th
rows as well as the 1st and 3rd columns. This is where
indexing gets slightly tricky. The following may seem a natural guess
at how to accomplish this



In [None]:
%%R
A[[1,3],[0,2]]

We see this has given us a one-dimensional array of length 2 identical to



In [None]:
%%R
np.array([A[1,0],A[3,2]])

It will now not be surprising that the following code will not
extract the 2nd and 4th rows as well as the 1st, 3rd and 4th columns:



In [None]:
%%R
A[[1,3],[0,2,3]]
	      

One can also `slice` arrays. To retrieve the 2nd and 3rd rows of
`A` we can use the following:



In [None]:
%%R
A[1:3] #CHOPLINE

If we pass two slices to the square brackets then we see that `numpy`
does retrieve submatrices.



In [None]:
%%R
A[1:3,0:2] #CHOPLINE

Why the difference? The notation `1:3` is equivalent to `slice(1,3,1)`
which is an object that represents the indices greater than or equal to 1, but less than 3,
spaced apart by steps of size 1 (the object `slice(4,20,5)`
would be equal to $\{j: j=4+5i, i=0,1,2, j<20\}$). Informally this slice
`1:3` is the same
as just the list `[1,2]` but they are different python types and
are treated differently by `numpy`.
Slices can be used to extract objects from arbitrary sequences (strings, lists, tuples).

To extract submatrices of a given set of rows and a given set of columns one can
first create a smaller matrix by subsetting the rows, and then subsetting the columns.



In [None]:
%%R
tmp = A[[1,3]]
method1 = tmp[:,[0,2]]
method1 #CHOPLINE

In [None]:
%%R
method2 = A[[1,3]][:,[0,2]]
method1 - method2 #CHOPLINE

For large matrices, this method unnecessarily creates a temporary
matrix `tmp` (even for the `method2` example). This can be
avoided by using a convenience function `np.ix_` to help
specify the indices.



In [None]:
%%R
idx = np.ix_([1,3],[0,2])
A[idx] #CHOPLINE

We can also select or drop rows with arrays filled with
booleans `True` or `False`.



In [None]:
%%R
keep_rows = np.zeros(A.shape[0], np.bool)
keep_rows[[1,3]] = 1
keep_cols = np.zeros(A.shape[1], np.bool)
keep_cols[[0, 2]] = 1
idx_bool = np.ix_(keep_rows, keep_cols)
A[idx_bool] #CHOPLINE

For more details on indexing in `numpy`, students are referred
to the numpy tutorial above.

### Loading Data

For most analyses, the first step involves importing a data set into
`python`.  Readers will have noticed that the matrices above
(2-dimensional arrays) did not carry any names related to either rows
or columns, and all data within the array was of the same type.
Often, data sets have different types of data. This requires a
different object to store this type of data, typically called a
data frame. In `python`, the `pandas`
library provides a data frame object.

Data frames can be read in from
a file using functions provided by `pandas`. Before attempting to load
a data set, we must make sure that `python` knows to search for the
data in the proper directory or give a full path name for the
file. The working directory can be set with the command
`os.chdir`. Make sure that the
files `Auto.csv` and `Auto.data` (available
on the book’s website) are in the same location
as this notebook file.
Alternatively, we can give a full or absolute path
to the file.

For instance a
comma-separated file CSV can be read as follows:



In [None]:
%%R
import pandas as pd
Auto = pd.read_csv('Auto.csv')

A data frame
is perhaps most easily understood as a sequence
of arrays of identical length which we typically
think of as *columns*. Entries in the
different arrays can be combined to form a *row* corresponding
to a subject’s data. The columns are named (as are the rows) and can be accessed
using the column names



In [None]:
%%R
Auto['horsepower'] #SUPPRESSOUTPUT

We have another version of this data called `Auto.data` that is whitespace
delimited. This can be read in as follows:



In [None]:
%%R
Auto = pd.read_csv('Auto.data', delim_whitespace=True)

Note that `Auto.csv` and `Auto.data` are simply text
files, which you could alternatively open on your computer using a
standard text editor. It is often a good idea to view a data set using
a text editor or other software such as Excel before loading it into
`python`.

This particular data  set has not been loaded correctly, as missing values have been encoded as
`?` in the text file. We can see this by inspecting some of the variables:



In [None]:
%%R
np.unique(Auto['horsepower']) #SUPPRESSOUTPUT

This can be fixed by giving `pd.read_csv()` an argument called `na_values`:



In [None]:
%%R
Auto = pd.read_csv('Auto.data',
                   na_values=['?'],
                   delim_whitespace=True)
np.unique(Auto['horsepower'][:4]) 

The `Auto.shape` attribute tells us that the data has 397
observations, or rows, and nine variables, or columns.  There are
various ways to deal with the missing data. 
In this case, only five of the rows contain missing
observations, and so we choose to use the `na.omit()`
function to simply remove these rows.



In [None]:
%%R
Auto_new = Auto.dropna()
Auto.shape, Auto_new.shape

Once the data are loaded correctly, we can use `Auto.columns` to check the variable names.



In [None]:
%%R
Auto = Auto_new	      
Auto.columns #CHOPLINE

Accessing data from a data frame is similar to a matrix but not identical. Data frames
maintain identifiers for rows in their `index` attribute and names for columns
in their `columns` attribute. We can therefore access data by *name* from a data frame.
The primary index of the data frame is their column (so they are similar to column major
matrices but the type of each entry need not be identical as in an array).



In [None]:
%%R
Auto[['mpg', 'horsepower']][:3]

We did not specify an index column when we loaded our data frame. We can do that using the
`set_index()` method.



In [None]:
%%R
Auto_re = Auto.set_index('name')
Auto_re #SUPPRESSOUTPUT 

We can now access rows of the data frame using the `loc[]` method of
`Auto`



In [None]:
%%R
rows = ['amc rebel sst',
        'ford torino']
Auto_re.loc[rows] #SUPPRESSOUTPUT

It is sometimes still of interest to access rows of the data frame using
numeric indices. The rows above are the 4th and 5th row of `Auto`. We
can retrieve these using the `iloc[]` method:



In [None]:
%%R
Auto_re.iloc[[3,4]] #SUPPRESSOUTPUT

Index entries need not be unique: there are several cars with `name == ford galaxie 500`
but different years.



In [None]:
%%R
Auto_re.loc['ford galaxie 500', ['mpg', 'origin']]	      

To ensure unique indices here we can create a multidimensional index.



In [None]:
%%R
Auto_m = Auto.set_index(['name', 'year'])
Auto_m.loc[('ford galaxie 500', 70), ['mpg', 'horsepower']]

Rows can also be selected with an anonymous function called a
`lambda`:



In [None]:
%%R
Auto_re.loc[lambda df: df['year'] > 80,
            ['weight', 'origin']][:5]

### Additional Graphical and Numerical Summaries

We can use the `plt.scatter()` function to produce *scatterplots* 
of the quantitative variables. However, simply typing the variable names will produce an error message,
because `python` does not know to look in the `Auto` data set for those variables.



In [None]:
%%R
plt.plot(cylinders, mpg, 'o') #SUPPRESSOUTPUT

To refer to the variable `cylinders` we can use `Auto.cylinders` or `Auto['cylinders']`.
Alternatively, we can plot data from the data frame using the `Auto.plot()` method:



In [None]:
%%R
plt.scatter(Auto.horsepower, Auto.mpg) #SUPPRESSOUTPUT
Auto.plot.scatter('horsepower', 'mpg') #SUPPRESSOUTPUT

The `cylinders` variable is stored as a numeric vector, so `pandas` has treated it as quantitative.
However, since there are only a small number of possible values for `cylinders`,
one may prefer to treat it as a qualitative variable. The `dtype='category'` argument
converts quantitative variables into qualitative variables.



In [None]:
%%R
Auto.cylinders = pd.Series(Auto.cylinders,
                           dtype='category')

If the variable plotted on the $x$-axis is categorial, then  *boxplots* 
can be produced using the `boxplot()` method.



In [None]:
%%R
Auto.boxplot('mpg', by='cylinders');

The `Auto.hist()` method can be used to plot a *histogram*.
Note that `col=2` has the same effect as `col="red"`.



In [None]:
%%R
Auto.hist('mpg');

In [None]:
%%R
Auto.hist('mpg', color='red');

In [None]:
%%R
Auto.hist('mpg', bins=12); 

We often want to visualize pairwise relationships between variables in
a data frame. The
`pd.plotting.scatter_matrix()` function creates a
*scatterplot matrix* i.e. a scatterplot for every pair of continuous
variables for any given data set.  We can also produce scatterplots
for just a subset of the variables.



In [None]:
%%R
pd.plotting.scatter_matrix(Auto);

In [None]:
%%R
pd.plotting.scatter_matrix(Auto[['mpg',
                                 'displacement',
                                 'weight']]);

The `describe()` method produces a numerical summary of each variable in a particular data set.



In [None]:
%%R
Auto[['mpg', 'weight']].describe() #CHOPLINE

We can also produce a summary of just a single variable.



In [None]:
%%R
Auto['cylinders'].describe()
Auto['mpg'].describe() #CHOPLINE

Once we have finished using `Jupyter`, we can select `File->Close and Halt`.