# The Seeds Dataset

Welcome to data science! This text is designed to teach you at the practice of data science by working on Data sets. This text is being written as a series of interactive Jupyter notebooks. If you have the notebooks you can read them at the same time as you read the text and run the code interactively. Furthermore, an emphasis will be placed upon practicing data science using the Jupyter notebook server. 

The fastest way to install Jupyter is by following the instructions here: http://jupyter.readthedocs.io/en/latest/install.html. That said, I recommend taking the time to install Jupyter using Docker. Installing Jupyter using 

## Interactive Programming

> Interactive computing is a dialog between people and machines.
          — [Beki Grinter](https://beki70.wordpress.com/2011/01/27/what-is-interactive-computing/)


### IPython

IPython is short for interactive Python and is an highly-evolved Python
REPL (read-eval-print loop) with a set of tools for interacting with any
and all Python libraries.

**Note** Be careful not to confuse IPython, the command
line REPL, and IPython Notebook, the legacy notebook server that has
evolved into Jupyter.

When an IPython session is terminated all interactions are lost.

### Jupyter

Jupyter is:

- a web-based interactive application
- an interactive code interpreter
- a presentation environment
- a new paradigm in programming
- a way to save complex terminal sessions.

Jupyter is fundamentally changing the way we write code.

Jupyter replaces if [`__name__ == "__main__"`](http://ibiblio.org/g2swap/byteofpython/read/module-name.html):.

### Jupyter as Persistent Interactive Computing

- Jupyter Notebooks are the evolution of IPython.
- Jupyter allows users to combine live code, markdown and latex-rich text, images, plots, and more in a single document.
- Jupyter is the successor to the IPython notebook, Jupyter was renamed as the platform began to support other software kernels, in particular **Ju**lia, **Pyt**hon,  and **R**.

Jupyter notebooks are saved as JSON files and at their most basic level are IPython sessions that can be repeatedly run.

The output of the last line in a Jupyter cell will be implicitly displayed by the Jupyter Notebook. Try the following (Type the strings as you see them, one per line) in a Jupyter Notebook cell:

In [1]:
"Hello, World!"

'Hello, World!'

Hit `SHIFT+Enter` to execute the cell.

The Jupyter notebook has implicitly rendered the string that appeared on the last line of the cell. To look at this a bit more, we import the `display` function that is being used by the notebook.

In [2]:
from IPython.display import display

Next, we type three strings, each on a different line. Note that only the last string is displayed. Again, the Jupyter notebook has implicitly rendered the string that appeared on the last line of the cell.

In [3]:
"Hello, my baby!"
"Hello, my honey!"
"Hello, my ragtime gal!"

'Hello, my ragtime gal!'

Finally, we explicitly display all of the strings by calling the `display` function ourselves.

In [4]:
display("Hello, my baby!")
display("Hello, my honey!")
display("Hello, my ragtime gal!")

'Hello, my baby!'

'Hello, my honey!'

'Hello, my ragtime gal!'

### How to Program Interactively

Define a variable `my_integer` that is equal to 5.

In [5]:
my_integer = 5

Note that in defining the variable, the value was not actually displayed. To reiterate, the only thing that will be implicictly displayed is the last value in a cell.

In [6]:
my_integer

5

Redefine `my_integer` to be equal to 23.

In [7]:
my_integer = 23

Once more display the result. 

In [9]:
my_integer

23

### Plotting a Strange Attractor

This is not a very interesting use of interactive programming. 

https://en.wikipedia.org/wiki/Attractor#Strange_attractor

In [None]:
from scipy.integrate import odeint as odeint
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D

In [None]:
def create_attractor(sigma, beta, rho):
    r0 = [0.,1.,0.]
    t = np.linspace(0,50,50000)
    
    def lorenz_oscillator(r0, sigma=sigma, beta=beta, rho=rho):
        x, y, z = r0
        return [sigma*(y - x), rho*x - x*z - y, x*y - beta*z]

    return odeint(lorenz_oscillator, r0, t)

In [None]:
fig = plt.figure(figsize=(20,6))
fig.gca(projection='3d')

X = create_attractor(10.0, 8./3., 5.0)
plt.plot(X[:,0],X[:,1], X[:,2], label='lorenz oscillator')
plt.legend()

In [None]:
fig = plt.figure(figsize=(20,6))
fig.gca(projection='3d')

X = create_attractor(10.0, 8./3., 10.0)
plt.plot(X[:,0],X[:,1], X[:,2], label='lorenz oscillator')
plt.legend()

In [None]:
fig = plt.figure(figsize=(20,6))
fig.gca(projection='3d')

X = create_attractor(10.0, 8./3., 15.0)
plt.plot(X[:,0],X[:,1], X[:,2], label='lorenz oscillator')
plt.legend()

In [None]:
fig = plt.figure(figsize=(20,6))
fig.gca(projection='3d')

X = create_attractor(10.0, 8./3., 32.0)
plt.plot(X[:,0],X[:,1], X[:,2], label='lorenz oscillator')
plt.legend()

#### What do you think the `rho` parameter does?

## Python

IPython magic are special commands that can be used to interact with your System.

Enter the following into an IPython session:

In [None]:
(an_integer, 
 a_list,
 a_dictionary,
 a_set,
 a_tuple) = 1, [1,2,3], {'k': 1}, {1,2,2,3}, (1,2)

### Writing a Function

Functions in python are written using the keyword `def`.

    def function_name(arg1, arg2):
        output_1 = do_something_with(arg1)
        output_2 = do_something_with(arg2)
        return 

In [None]:
from sys import getsizeof

def sizeof_and_value(variable):
    return (getsizeof(variable), variable)

In [None]:
sizeof_and_value(an_integer)

In [None]:
sizeof_and_value(a_tuple)

#### Write a Function

Write a function named `type_and_value` that returns the value and the type of a variable that is passed to it.

Use the block below to define the function. 

In [None]:
def type_and_value(var):
    """return the type and value of a variable.
    """
    ### BEGIN SOLUTION
    return type(var), var
    ### END SOLUTION

Make sure that your function can pass this test:

In [None]:
assert type_and_value(an_integer) == (int, 1)

### BEGIN HIDDEN TESTS
assert type_and_value(an_integer) == (int, 1)
assert type_and_value(a_list) == (list,[1,2,3])
assert type_and_value(a_dictionary) == (dict,{'k': 1})
assert type_and_value(a_set) == (set,{1,2,2,3})
assert type_and_value(a_tuple) == (tuple,(1,2))
### END HIDDEN TESTS

### IPython Magic

### `%whos`

Prints a table with some basic details about each identifier you have defined interactively.


In [None]:
%whos 

### Bash in IPython

Bash is a command line language used to interactive with an operating system.

Some simple Bash commands can be run in IPython/Jupyter, including

- `pwd`
- `cd`
- `ls`

#### `pwd` - print working directory

In [None]:
%pwd

#### `cd` - change directory

In [None]:
%cd src/

In [None]:
%pwd

#### `ls` - list files

In [None]:
%ls

In [None]:
# HIDDEN TEST

### BEGIN HIDDEN TESTS
import os
assert os.getcwd().split('/')[-1] == 'src'
### END HIDDEN TESTS

### IPython Magic commands

There are many IPython magic commands, but some of the more useful are

- `run`
- `matplotlib inline`
- `whos`


In [None]:
%run a_simple_script.py

In [None]:
%matplotlib inline

In [None]:
%run a_simple_script.py

#### Why didn't the image show up the first time?

## The Python Numerical Stack

Consists of:

- numpy/scipy (vectors and computational mathematics)
- pandas (dataframes)
- matplotlib (plotting)
- seaborn (statistical plotting)
- scikit-learn (machine learning)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

**Note** We typically only import what we need from scikit-learn e.g.

In [None]:
from sklearn.linear_model import LinearRegression

#### Data Set Information:

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link] 

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes. 

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]. 

This database is also available through the UW CS ftp server: 
ftp ftp.cs.wisc.edu 
cd math-prog/cpo-dataset/machine-learn/WDBC/



> The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
          — [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)



In [None]:
seeds_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt",
                      header=None, sep="\s+")
seeds_data.columns = [
"area",
"perimeter",
"compactness",
"length of kernel",
"width of kernel ",
"asymmetry coefficient ",
"length of kernel groove",
"Class"
]

In [None]:
seeds_data.shape

#### What does `.shape` do?

### Dataframes

We will have loaded the breast cancer data into a dataframe for ease of manipulation.

In [None]:
seeds_data.head()


### Pair Plot

We will use Searborn to prepare a **Pair Plot** of the Iris dataset. A Pair Plot is an array of scatter plots, one for each pair of features in the data. Rather than plotting a feature against itself, the diagonal is rendered as a **probability distribution** of the given feature.


In [None]:
sns.pairplot(seeds_data)


### List Comprehension

We will use a **list comprehension** to remove the units and white space from the feature names to make them more "computer-friendly".


In general, list comprehensions have this form:

```python
lc = [do_something_to(var) for var in some_other_list]
```


In [None]:
def square_number(x):
    return x**2

In [None]:
[square_number(i) for i in (1,2,3,4,5)]

#### Write your own list comprehension

Write a function that uses a list comprehension to change this list

    [1,2,3,4,5]
    
into this list

    [2,3,4,5,6]

In [None]:
def incr_list_by_1(lst):
    """returns a list where each value in the list has been incremented by one"""
    
    ### BEGIN SOLUTION
    return [i+1 for i in lst]
    ### END SOLUTUON

In [None]:
assert incr_list_by_1([1,2,3,4,5]) == [2,3,4,5,6]

### BEGIN HIDDEN TESTS
assert incr_list_by_1([1,2,3,4,5,1,2,3,4,5]) == [2,3,4,5,6,2,3,4,5,6]
### END HIDDEN TESTS

### Remove Unit and White Space from Feature Name

Here we use a list comprehension to change the feature names:

In [None]:
seeds_data.columns

In [None]:
def remove_unit_and_white_space(feature_name):
    feature_name = feature_name.replace(' (cm)','')
    feature_name = feature_name.replace(' ', '_')
    return feature_name

In [None]:
seeds_data_features_names = [remove_unit_and_white_space(name) for name in seeds_data.columns]

In [None]:
seeds_data_features_names

In [None]:
seeds_data.columns = seeds_data_features_names
seeds_data.head()

### Export to CSV

Ultimately, we will export a CSV of the dataframe to disk. This will make it easy to access the same data from both Python and R.


In [None]:
%ls

In [None]:
%mkdir -p data

In [None]:
%ls

In [None]:
bc_data.to_csv('data/seeds_data.csv')

## Prediction

### Why estimate $f$?

We can think of a given dataset upon which we are working as a representation of some actual phenomenon. We can imagine there to be some sort of "universal" function, $f$, that was used to generate the data, one that we can never truly know.

As data scientists, we will seek to estimate this function. We will call our estimate $\hat{f}$ ("eff hat").

There are two main reasons we might want to estimate $f$ with $\hat{f}$:

- prediction
   - given some set of known inputs and known outputs, we may wish to create some function that can take a new set of inputs and predict what the output would be for these inputs
- inference
   - given some set of known inputs and (optionally) known outputs, we may wish to understand how the inputs (and outputs) interact with each other

In [None]:
%pwd

#### What does `pwd` tell us? What does this mean in the context of a Jupyter Notebook? Why would it be important to think about this before we load a csv file?

In [None]:
%ls

In [None]:
seeds_data.describe()

In [None]:
plt.figure(1, (20,10))

sns.pairplot(seeds_data)

Having a look at the pair plot, we might say that we are able to the uniformity of cell shape using the uniformity of cell size.

In [None]:
plt.figure(1, (20,5))

sns.regplot('area','perimeter', data=seeds_data)

### Linear Regression

We might build a **simple regression model** to do this for us using scikit-learn. Here, the **input variable** would be `petal length` and the **output variable** would be `petal width`.

We will usually refer to our input variable(s) as **feature(s)** and our output variable as the **target**.

### Build a Simple Regression Model

In [None]:
from patsy import dmatrices

target, features = dmatrices("perimeter ~ area", seeds_data)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_regression_model = LinearRegression()
linear_regression_model.fit(features, target)

### Plot the Results

Having prepared the regression model, we use it to make predictions.

We then plot the predictions versus the actual values.


In [None]:
plt.figure(1, (20,5))

sns.regplot('area','perimeter', data=seeds_data)

predictions = linear_regression_model.predict(features)
plt.scatter(seeds_data.area, predictions, marker='x', color='red')

#### What does this plot show us? 

## The Train-Test Split

What if we wish to know how well petal width can be predicted for unseen data?

![](doc/img/ttspl.png)

### Overfitting and Underfitting

When fitting a model for making predictions, a model is only as good as its ability to work on unseen data. A model that does not learn the underlying patterns in the data is said to be **underfit**. A model that learns that underlying patterns in the data too well is said to be **overfit**.

### Learning Too Well is a Problem!?

It may seem odd to think of a model that has learned to well as being bad in some way, but recall that we are looking to make predictions with new input data. A model that is overfit will have learned the patterns in its **training** data, but will also have learned the noise inherent to this data. New input data will have completely different noise *by definition*. A model that is overfit will be poor at generalization and will not perform well on data it has never seen.

### The Train-Test Split

Of course, we will not have access to the new data we will use at the time of fitting the model. We will have to simulate new data in some way. We do this, by creating **test** data using some fraction of the original data we started with.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
target, features = dmatrices("perimeter ~ area", seeds_data)

Of course, we will not have access to the new data we will use at the time of fitting the model. We will have to simulate new data in some way. We do this, by creating **test** data using some fraction of the original data we started with.

In [None]:
(features_train,
 features_test,
 target_train,
 target_test) = train_test_split(features, target, random_state=42) 

In [None]:
(features_train.shape,
 target_train.shape,
 features_test.shape,
 target_test.shape)

In [None]:
features_test[:5]

In [None]:
linear_regression_model = LinearRegression(fit_intercept=False)

linear_regression_model.fit(features_train, target_train)

petal_width_prediction_1_var = (linear_regression_model
                                .predict(features_test))

## Inference

### Why estimate $f$? 

Note that the next few cells are executed using R.

There are two mean reasons we might want to estimate $f$ with $\hat{f}$:

- prediction
   - given some set of known inputs and known outputs, we may wish to create some function that can take a new set of inputs and predict what the output would be for these inputs
- inference
   - given some set of known inputs and (optionally) known outputs, we may wish to understand how the inputs (and outputs) interact with each other


In [None]:
plt.figure(1, (20,5))

plt.scatter(features_test[:, 1], target_test, 
            marker='o', color='blue', alpha=0.5, 
            label='actual test values')
plt.scatter(features_test[:, 1], petal_width_prediction_1_var,
            marker='x', color='red', alpha=0.5, 
            label='predicted test values - 1 variable')
plt.legend()

#### Explain why we use the train-test split in the context of overfitting and underfitting.

In [None]:
seeds.data = read.csv('data/seeds_data.csv', row.names=1)

#### Sanity Check

In [None]:
head(seeds.data)

In [None]:
summary(seeds.data)

In [None]:
library(repr)

In [None]:
options(repr.plot.width=20, repr.plot.height=10)

In [None]:
pairs(seeds.data)

In [None]:
library(ggplot2)

In [None]:
options(repr.plot.width=20, repr.plot.height=5)

ggplot(seeds.data, aes(length.of.kernel, length.of.kernel.groove)) +
  geom_point() + 
  geom_smooth(method='lm')

### Build a Simple Regression Model

Armed with this information we might say that we are able to predict petal width if we know petal length. We might build a **simple regression model** to do this for us using scikit-learn. Here, the **input variable** would be `petal length` and the **output variable** would be `petal width`.

We will usually refer to our input variable(s) as **feature(s)** and our output variable as the **target**.

In [None]:
lm_1_var = lm('length.of.kernel.groove ~ length.of.kernel', seeds.data)
lm_1_var

In [None]:
lm('length.of.kernel.groove ~ .', seeds.data)

#### What can be inferred from the coefficients?

- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?