# Lab 01 - Introduction

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
%matplotlib inline
sns.set_style("darkgrid")

import sys
sys.path.append('../')
from lib.processing_functions import convert_to_pandas

## Exercise goals:

- Familiarize yourself with the scikit-learn online documentation
- Explore some of the datasets that will be used

---
## Exercise 1: Scikit-learn online documentation

### 1.1 Display the estimator cheat-sheet

Find the estimator cheat-sheet on the online documentation. Copy the image url and use it in the cell below to display the cheat sheet in this notebook:

(In case this doesn't work: just keep it open in another tab in your browser.)

```python
# TODO: Replace <FILL IN> with appropriate code
from IPython.display import Image
url = <FILL IN>
Image(url)
```

In [None]:
%load ../answers/01_01_estimator.py

### 1.2 Analyze the cheat-sheet 

- What are the four main classes of algorithms found on cheat-sheet?
- If one wants to do classification on a moderate size text dataset, what does the cheat-sheet advise?
- Use the scikit-learn website to figure out how the statement "few features should be important" on the cheat-sheet relates to the Lasso algorithm.

### 1.3 Scikit-learn documentation

Click on the following link http://scikit-learn.org/stable/documentation.html and answer the questions below:

- What are the 7 topics of the scikit-learn user guide?
- Go to '1.1. Generalized Linear Models' of the user guide; make a list of the algorithms described here with which you're not yet familiar with.
- What does the user guide state as the use-case for K-Means clustering in '2.3. Clustering'? 
- Use the API documentation to list three of the metrics present in the `sklearn.metrics` module. What is the required input for the `f1_score` function?

---
## Exercise 2: Boston dataset

### 2.1 Dataset description:

Consider the description of this dataset printed below:

In [None]:
boston = datasets.load_boston()
print(boston.DESCR)

**Question**: For what type of learning problems is this dataset useful (eg. regression, classification or clustering)? 

### 2.2 Descriptive statistics

Use pandas to show the descriptive statistics of this dataset:

```python
# TODO: Replace <FILL IN> with appropriate code
# load the boston dataset
X, y = convert_to_pandas(boston)

# join features and targets
all_data = X.join(y)

# Show descriptive statistics
all_data.<FILL IN>
```

In [None]:
%load ../answers/01_02_descriptive.py

**Question**: Which feature has the largest spread and which one the smallest spread? Is this useful information to know?

### 2.3 Correlation 

Use Pandas to show that pairwise correlation of the target and all the features:

```python
# TODO: Replace <FILL IN> with appropriate code
# compute all pair wise correlations
pairwise_corr = all_data.<FILL IN>

# display result:
pairwise_corr['target'].plot(kind='bar');
```

In [None]:
%load ../answers/01_03_correlation.py

**Question**: Which feature has the largest (absolute) correlation with the target? Describe the found correlation.

---
## Exercise 3: Digits dataset

Import the digits dataset:

In [None]:
# load the digits dataset
digits = datasets.load_digits()
X, y = convert_to_pandas(digits)

### 3.1 Display a digit

Each feature in the dataset represents a pixel value of an 8x8 gray-scale image of a digit.
The images are flattened in 64 features and each image is a row in `X`.

Reshape `feature_vector` (a single image of `X`) as an 8x8 image matrix using `.values.reshape()` and use the `show_digit` function to display the result:

In [None]:
def show_digit(image, label=None, color='green', ax=None):
    """Shows a single digit.
    
    Parameters:
    -----------
    image : numpy 2D array
        Image to show.
    label : str, default None 
        Text to show on figure.
    color : str, default 'green'
        Color of text.
    ax : matplotlib axes
        Axis to plot on.
    """
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(2,2))
    ax.imshow(image, cmap='binary', interpolation='nearest')
    ax.set_axis_off()
    if label is not None:
        ax.text(0, 0, str(label), transform=ax.transAxes, color=color, 
                fontsize=16)

```python
# TODO: Replace <FILL IN> with appropriate code
# select one feature vector
feature_vector = X.iloc[65]

# convert the feature vector into an image matrix
image_matrix = feature_vector.<FILL IN>

# display the digit
show_digit(image_matrix)
```

In [None]:
%load ../answers/01_04_display_digit.py

### 3.2 Display 20 digits and their labels

The function `display_digits` below shows multiple digits with their labels. Use the function to display 20 digits and their labels:

In [None]:
# TODO: Replace <FILL IN> with appropriate code
def display_digits(X, y, y_pred=None, n_max=20):
    """Display multiple digits.
    
    Parameters
    ----------
    X : numpy ndarray or pandas DataFrame
        Image data (size n_images x n_pixels).
    y : iterable
        True labels (size n_images x 1).
    y_pred : iterable
        Predicted labels (size n_images x 1).
    n_max : int
        Maximum images to plot.
    """
    n = n_max if len(X) > n_max else len(X)
    if not isinstance(X, pd.DataFrame):
        X = pd.DataFrame(X[:n, :])
    if not isinstance(y_pred, pd.Series) and y_pred is not None:
        y_pred = pd.Series(y_pred)
    ncol = 10
    nrow = (n - 1) // ncol + 1
    fig,axes = plt.subplots(nrow, ncol, figsize=(ncol, nrow))
    fig.subplots_adjust(hspace=0.1, wspace=0.1)
    for i,ax in enumerate(axes.flat):
        if i<n:
            image = X.iloc[i].values.reshape(8,8)
            label = y.iloc[i] if y_pred is None else y_pred.iloc[i]
            color = 'green' if y.iloc[i] == label else 'red'
            show_digit(image, label, color, ax=ax)
        else:
            ax.axis('off')
    return fig

```python
# display 20 digits        
fig = <FILL IN>
```

In [None]:
%load ../answers/01_05_display_20.py

### 3.3 Target analysis

When doing classification it is important to know if our dataset is balanced, or in other words: if the number of samples per class is approximately equal.

Plot the value counts of the different labels in the data set:

```python
# TODO: Replace <FILL IN> with appropriate code
# use the value counts method to count the labels 
label_counts = y.<FILL IN>

# compute the average label count of all classes
avg_count = label_counts.mean()
print("avg label count: {}".format(avg_count))

# display the count for each label
label_counts.sort_index().plot(kind='bar',title='target label counts');
```

In [None]:
%load ../answers/01_06_target.py

**Question**: Is the digits dataset balanced?

### 3.4 Feature analysis

To learn a bit more about the features in the digits data we will look at the histogram of a feature's values over all samples.

Complete the cell below to show a histogram of the values of feature 35:

```python
# TODO: Replace <FILL IN> with appropriate code
# specifiy which feature column to select
sel_feature = 'feature_35'
# plot histogram
ax = X[sel_feature].<FILL IN>
ax.set_title('Value histogram {}'.format(sel_feature));
```

In [None]:
%load ../answers/01_07_feature.py

**Question**: What does the histogram tell us about the value distribution of feature 35? What is the location of this feature in the digit image?

Run the cell below to plot value histograms for each feature in the dataset. Note that the histogram's grid location corresponds to the pixel location in the 8 by 8 digit image.  

In [None]:
g = sns.FacetGrid(pd.melt(X), col='features', col_wrap=8, size=1., aspect=1.1)
bins = np.linspace(0, 16, 10)
g.map(plt.hist, 'value' ,color="steelblue", bins=bins)
for ax in g.axes.flat:
    ax.set_title('')
    ax.set_aspect('auto')
    ax.set_yticks([0,len(X)])
    ax.set_xticks([0,16])
g.fig.tight_layout(pad=0.4)
g.fig.suptitle('feature value histograms', fontsize=16, y=1.02);

**Question**: Based on this plot; which pixels do you think are the most useful for classification of the digits? And which ones will probably be less useful?

In [None]:
%load ../answers/01_questions.py