# Introduction to Python

In this lab, we will introduce some simple Python commands. The best way to learn a new language is to try out the commands. Python can be downloaded from https://www.anaconda.com/download/.

## Basic Commands

Python uses functions to perform operations. To run a function called `funcname`, we type `funcname(input1, input2)`, where the inputs (or _arguments_) `input1` and `input2` tell Python how to run the function. A function can have any number of inputs. For example, to create a vector of numbers, we use the function `np.array()` from the module `numpy`. To do this, we need to pass a `list` as an argument to the function, lists are a built-in data type in Python. Here we assume you are already familiar with basic Python objects.

In [None]:
# Start by importing the module numpy
import numpy as np

# Define a new vector x
x = np.array([1, 6, 2])

#Make Jupyter shows us the vector x, this only works interactively, not in scripts
x

Typing the `?funcname` will cause Jupyter to pop up a help panel with additional information about the function `funcname`.

In [None]:
?np.array

We now make a second vector `y`.

In [None]:
y = np.array([1, 4, 3])

We can tell Python to add two sets of numbers together. It will then add the first number from `x` to the first number from `y`, and so on. However, `x` and `y` should be the same length. We can check their length using the `len()` function.

In [None]:
len(x)

In [None]:
len(y)

In [None]:
x + y

The `%whos` "magic" command allows us to look at a list of all the objects, such as data and functions, that we have saved so far.

In [None]:
%whos

The `del` command can be used to delete any that we don't want.

In [None]:
del x

In [None]:
%whos

It's also possible to remove all objects at once.

In [None]:
%reset

In [None]:
%whos

Lets import `numpy` again.

In [None]:
import numpy as np

The `np.matrix()` function can be used to create a matrix of numbers. Before we use the `np.matrix()` function we can learn more about it.

In [None]:
?np.matrix

The help reveals that the `matrix()` function can take a number of inputs, but for now we focus on how to build a simple matrix. To build a matrix, we can input a list of lists as a parameter. Each list is a row.

In [None]:
x = np.matrix([[1, 3], [2, 4]])
x

If `data` is a string, it is interpreted as a matrix with commas or spaces separating columns, and semicolons separating rows.

In [None]:
x = np.matrix('1, 3; 2, 4')
x

The `np.sqrt()` function return the square root of each element of a vector or a matrix. The function `np.power(x, 2)` raises each element of `x` to the power of 2; any powers are possible, including fractional or negative powers.

In [None]:
np.sqrt(x)

In [None]:
np.power(x, 2)

The `np.random.normal()` function generates a vector of random normal variables, with third argument `size` the sample size. Each time we call this function, we will get a different answer. Here we create two correlated sets of numbers, `x` and `y`, and use the `np.corrcoef()` function to compute the correlation matrix between them.

In [None]:
x = np.random.normal(loc=0, scale=1, size=50)
y = x + np.random.normal(50, 0.1, 50)
np.corrcoef(x, y)

By default, `np.random.normal` creates standard random variables with mean of 0 and standard deviation of 1. However, the mean and standard deviation can be altered using the `loc` and `scale` arguments, as illustrated above. Sometimes we want our code to reproduce the exact same set of random numbers; we can use the `np.random.seed()` function to do this. The `np.random.seed()` function takes an (arbitrary) integer argument.

In [None]:
np.random.seed(1303)
np.random.normal(size=50)

We use `np.random.seed()` throughout the labs whenever we perform calculations involving random quantities in order to obtain reproducible results.

The `np.mean()` and `np.var()` functions can be used to compute the mean and variance of a vector of numbers. Applying `np.sqrt()` to the output of `np.var()` will give the standard deviation (or we can use `np.std()`).

In [None]:
np.random.seed(3)
y = np.random.normal(size=100)
np.mean(y)

In [None]:
np.var(y)

In [None]:
np.std(y)

## Graphics

The `plt.plot` function is the primary way to plot data in Python. For instance, `plt.plot(x, y, 'o')` produces a scatterplot of the numbers in `x` versus the numbers in `y`. There are many additional options that can be passed in to the `plt.plot()` function, and many other functions that alter the appearance of the plot. For example, the `plt.xlabel()` function will result on a label in the x-axis. To find out more information about the `plt.plot()` function, type `?plt.plot`.

In [None]:
# First import the matplotlib module
import matplotlib.pyplot as plt
# To show plot in Jupyter we need the following magic command
%matplotlib inline


x = np.random.normal(size=100)
y = np.random.normal(size=100)
plt.plot(x, y, 'o')

In [None]:
plt.plot(x, y, 'o')
plt.xlabel('This is the x-axis')
plt.ylabel('This is the y-axis')
plt.title('Plot of X vs Y')

We will often want to save the output of a Python plot. We do this with the `plt.savefig()` function. We can choose the type of format to output by changing the extension of the file name. For instance, to create a pdf, we use `plt.savefig('output.pdf')`, and to create a jpeg, we use `plt.savefig('output.jpeg')`.

In [None]:
plt.plot(x, y, 'o')
plt.xlabel('This is the x-axis')
plt.ylabel('This is the y-axis')
plt.title('Plot of X vs Y')
plt.savefig('Figure.pdf')

The function `np.arange` can be used to create a sequence of numbers. For instance, `np.arange(a, b)` makes a vector of integers between `a` and `b`, excluding `b`. There are other functions: for instance, `np.linspace(a, b, n)` makes a sequence of `n` numbers that are equally spaced between `a` and `b`.

In [None]:
np.arange(1, 11)

In [None]:
np.arange(1, 11, 2)

In [None]:
x = np.linspace(-np.pi, np.pi, 50)
x

We will now create some more sophisticated plots. The `plt.contour()` function produces a contour plot in order to represent three-dimensional data; it is like a topographical map. It takes three arguments:
1. A vector of the `x` values (the first dimension),
2. A vector of the `y` values (the second dimension), and
3. A matrix of the `z` values (the third dimensions) for each pair of (`x`, `y`) coordintes.

As with the `plt.plot()` function, there are many other inputs that can be used to fine-tune the output of the `plt.contour()` function. To learn more about these, take a look at the help file by typing `?plt.contour`.

In [None]:
y = x
f = np.matrix([[np.cos(j)/(1 + i**2) for j in y] for i in x])
plt.contour(x, y, f)

In [None]:
fa = (f - f.T)/2
plt.contour(x, y, fa, 15)

The `plt.contourf()` function works the same way as `plt.contour()`, except that it produces a color-coded plot whose colors depend on the `z` value. This is known as heatmap, and is sometimes used to plot temperature in weather forecasts. Alternatively, `plt.` can be used to produce a three-dimensional plot. 

In [None]:
plt.contourf(x, y, fa, 50)

In [None]:
from mpl_toolkits.mplot3d import axes3d

X, Y = np.meshgrid(x, y)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.view_init(30, 0)
ax.plot_wireframe(X, Y, fa)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.view_init(30, 20)
ax.plot_wireframe(X, Y, fa)

## Indexing Data

We often wish to examine part of a set of data. Suppose that our data is stored in the matrix `A`.

In [None]:
A = np.arange(16).reshape(4,4)
A = A.T
A

Then, typing

In [None]:
A[1, 2]

will select the element corresponding to the second row and the third column. The first number after the open-bracket symbol [ always refers to the row, and the second number always refers to the column. We can also select multiple rows and columns at a time, by using the `np.ix_()` function.

In [None]:
A[np.ix_([0,2],[1,3])]

In [None]:
A[0:3, 1:4]

In [None]:
A[:,0:2]

In [None]:
A[0:2, :]

The last two examples include either `:` for the columns or `:` for the rows. These indicate that Python should include all columns or all rows, respectively. Python treats a single row or column of a matrix as a vector.

In [None]:
A[0]

The `np.shape()` function outputs the number of rows followed by the number of columns of a given matrix

In [None]:
np.shape(A)

## Loading Data

For most analyses, the first step involves importing a data set into Python. The `pd.read_table()` function is one of the primary ways to do this. The help file contains details about how to use this function. We can export data using the `.to_{format}` method.

Before attempting to load a data set, we must make sure that Python knows to search for the data in the proper directory. We can do this by specifying the path in the argument of the `pd.read_table()` function. We begin by loading in the Auto data set. The following command will load the `Auto.csv` file into Python and store it as an object called `Auto`, in a format referred to as a `data frame`.

In [None]:
# First import the pandas module.
import pandas as pd

In [None]:
Auto = pd.read_table('Data/Auto.csv', delimiter=',')
Auto

Note that `Auto.csv` is simply a text file, which you could alternatively open on your computer using a standard text editor. It is often a good idea to view a data set using a text editor or other software such as Excel before loading it into Python.

The data set also includes a number of missing observations, indicated by a question mark '?'. Missing values are a common occurrence in real data sets.

In [None]:
Auto[Auto.values == '?']

In [None]:
Auto[Auto.isnull().values]

One option is to replace this values with `NaN`.

In [None]:
Auto = Auto.replace('?', np.NaN)

In [None]:
Auto[Auto.isnull().values]

Other way is to use the option `na_values` that tells Python that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix.

In [None]:
Auto = pd.read_table('Data/Auto.csv', delimiter=',', na_values=['?'])

In [None]:
Auto[Auto.isnull().values]

In [None]:
Auto.shape

The `shape` method tells us that the data has 397 observations, or rows, and nine variables, or columns. There are various ways to deal with the missing data. In this case, only five of the rows contain missing observations, and so we choose to use the `dropna` method to simply remove these rows.

In [None]:
Auto = Auto.dropna()

In [None]:
Auto.shape

Once the data are loaded correctly, we can use the method `columns` to check the variable names.

In [None]:
Auto.columns

## Additional Graphical and Numerical Summaries

We can use the plot() function to produce scatterplots of the quantitative variables. However, simply typing the variable names will produce an error message, because Python does not know to look in the Auto data set for those variables.

To refer to a variable, we must type the data set and the variable name joined with a `.` symbol.

In [None]:
Auto.plot('cylinders', 'mpg', 'scatter')

In [None]:
Auto.boxplot('mpg', 'cylinders')

In [None]:
Auto.hist('mpg', color='r', bins=15)

In [None]:
_ = pd.plotting.scatter_matrix(Auto, figsize=(15, 15))

In [None]:
_ = pd.plotting.scatter_matrix(Auto[['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']], figsize=(15, 15))

In [None]:
ax = Auto.plot.scatter(x='horsepower', y='mpg', figsize=(15, 15))
for i, txt in enumerate(Auto.mpg):
    ax.annotate(txt, (Auto.horsepower.iat[i],Auto.mpg.iat[i]))

The `describe()` method produces a numerical summary of each variable in a particular data set.

In [None]:
Auto.describe(include='all')

In [None]:
Auto.name.describe()

In [None]:
Auto.mpg.describe()