# Getting started with Yellowbrick's Regression Visualizations with StatsModels Wrapper


This is an quick tutorial on getting started with [Yellowbrick's StatsModels Wrapper](http://www.scikit-yb.org/en/develop/api/contrib/statsmodels.html?highlight=statsmodels), for  those who like using StatModels and want to explore Yellowbrick, or for those who would like to explore both. Since Yellowbrick uses Scikit-Learn, this wrapper gives users the opportunity to extend Yellowbrick to StatsModels.


#### **Just a few things before getting started...**

**What is [Yellowbrick](http://www.scikit-yb.org/en/latest/)**?
>*Yellowbrick* is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, but to produce visualizations for your models! 

**How to get started with Yellowbrick?**
>Visit Yellowbrick's [Quick Start](http://www.scikit-yb.org/en/latest/quickstart.html) page, for installation instructions, API information and walkthrough examples.  

**Yellowbrick Visualizers**
> Visualizers are estimators (objects that learn from data) whose primary objective is to create visualizations that allow insight into the model selection process. In Scikit-Learn terms, they can be similar to transformers when visualizing the data space or wrap an model estimator similar to how the “ModelCV” (e.g. RidgeCV, LassoCV) methods work. The primary goal of Yellowbrick is to create a sensical API similar to Scikit-Learn. Some of our most popular visualizers include:

>**Feature Visualization**
* Rank Features: pairwise ranking of features to detect relationships
* Parallel Coordinates: horizontal visualization of instances
* Radial Visualization: separation of instances around a circular plot
* PCA Projection: projection of instances based on principal components
* Feature Importances: rank features by importance or linear coefficients for a specific model
* Scatter and Joint Plots: direct data visualization with feature selection  

>**Classification Visualization**
* Class Balance: see how the distribution of classes affects the model
* Class Prediction Error: shows error and support in classification
* Classification Report: visual representation of precision, recall, and F1
* ROC/AUC Curves: receiver operator characteristics and area under the curve
* Confusion Matrices: visual description of class decision making  

>**Regression Visualization**
* Prediction Error Plot: find model breakdowns along the domain of the target
* Residuals Plot: show the difference in residuals of training and test data
* Alpha Selection: show how the choice of alpha influences regularization  

>**Clustering Visualization**
* K-Elbow Plot: select k using the elbow method and various metrics
* Silhouette Plot: select k by visualizing silhouette coefficient values 

> **Text Visualization**
* Term Frequency: visualize the frequency distribution of terms in the corpus
* t-SNE Corpus Visualization: use stochastic neighbor embedding to project documents.


>… and more! Visualizers are being added all the time; be sure to check the examples (or even the develop branch) and feel free to contribute your ideas for new Visualizers!

**What is the StatsModelsWrapper?**
> It is a basic wrapper for statsmdoels that emulates a scikit-learn estimator. *This wraps a statsmodels GLM as a sklearn (fake) BaseEstimator for Yellowbrick.*



**This tutorial... **
- uses the concrete dataset from Yellowbrick's Example Datasets, which are from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/),
- shows some of Yellowbrick's feature visualizations and regression visualizations,
- shows how to get started with the StatsModels Wrapper from Yellowbrick.

### Let's import Yellowbrick and load the concrete dataset

Below is from the Example Datasets [documentation](http://www.scikit-yb.org/en/develop/api/datasets.html).  

First, the data must be downloaded, for accessibility, download it in your current directory using the terminal.


```bash
    $ python -m yellowbrick.download 
```

This creates a directory named "data" with all the example datasets. For this tutorial, the conrete dataset must be downloaded using pandas.

**For the purpose of this tutorial, the target is "strength" of concrete.**  
Check out the [Concrete Data](https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength) from the UCI Machine Learning Repository. 

In [None]:
import yellowbrick

In [None]:
import pandas as pd
data = pd.read_csv('data/concrete/concrete.csv')

## Instantiating to Yellowbrick Visualizers

Yellowbrick visualizers act as a transformer to fit the `Visualizer` (or the model with the visualizer) a la Scikit-Learn workflow. This will make using scikit-learn easier, and in perspective the StatsModels Wrapper. 

As the workflow goes:
- import the visualizer
- call the `fit()` method to instantiate the visualizer
- call the `poof()` method for rendering the image!

Such as:
```python
from yellowbrick.features import ParallelCoordinates
visualizer = ParallelCoordinates()
visualizer.fit_transform(X, y)
visualizer.poof()
```



## Intro to Rank Features and Joint Plot Visualizations 
(Two of yellowbrick's visualizations, there are many more!)


### Rank Features

These [rank feature visualizations](http://www.scikit-yb.org/en/develop/api/features/rankd.html#rank-features) evaluate a single feature or a feature's relationship to another feature. 

#### *First*, **Rank1D**

>This is a one dimensional ranking of features, the default of Rank1D is the Shapiro-Wilk algorithm which assesses the distribution of each instances in regards to each feature. This will show a bar plot for each feature. 

To start, load the features and target data.

In [None]:
features = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']
target = 'strength'

X = data[features]
y = data[target]

Let's see **Rank1D**!

In [None]:
from yellowbrick.features import Rank1D

visualizer = Rank1D(features=features, algorithm='shapiro') #Instantiate the Visualizer

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.poof()                   # Draw/show/poof the data

#### *Then second*, **Rank2D**

>This is a two dimensional ranking of features, which compares each feature to another, and creates a heatmap to represent the relationship. The default algorithm of Rank2D is the Pearson Correlation, the covariance algorithm is also available. 



To start, load the features and target data.

In [None]:
features = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']
target = 'strength'

X = data[features]
y = data[target]

Let's see **Rank2d**!

In [None]:
from yellowbrick.features import Rank2D

visualizer = Rank2D(features=features, algorithm='pearson') #Instantiate the Visualizer

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.poof()                   # Draw/show/poof the data

### Joint Plot Visualization

>The [joint plot visualization](http://www.scikit-yb.org/en/develop/api/features/scatter.html#joint-plot-visualization) will show the distribution of a features against the target.


To start, load the features and target data.

In [None]:
feature = 'water'
target = 'strength'

X = data[feature]
y = data[target]

Let's see the **Joint Plot Visualizer**!

In [None]:
from yellowbrick.features import JointPlotVisualizer

visualizer = JointPlotVisualizer(feature=feature, target=target) #Instantiate the Visualizer

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.poof()                   # Draw/show/poof the data

## Accessing the StatsModelsWrapper

>The purpose of this basic [statsmodels wrapper](http://www.scikit-yb.org/en/develop/api/contrib/statsmodels.html) is to be able to emulate the scikit-learn estimator and be able to utilize Yellowbrick visualizations.



All that needs to be done is to import external libraries and instantiate the **statsmodel wrapper** with GLM.

In [None]:
import statsmodels.api as sm 
from functools import partial
from yellowbrick.contrib.statsmodels import StatsModelsWrapper

glm_gaussian_partial = partial(sm.GLM, family=sm.families.Gaussian()) # initiates a partial with statsmodels
model = StatsModelsWrapper(glm_gaussian_partial) # wrapper the statsmodels 

**And that's it! Now the GLM from statsmodel is ready to be used for regression visualizers!**



## Regression Visualizers

### Residuals Plot

> The [residuals plot](http://www.scikit-yb.org/en/latest/api/regressor/residuals.html) shows residuals and the predicted values. If the points are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

To start, load the features and target data, then, create test and train data.


In [None]:
from sklearn.model_selection import train_test_split

features = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']
target = 'strength'

X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Instantiate the model (here, the statsmodel GLM wrapped-up as a sklearn BaseEstimator) and visualizer.

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(model)

Let's see the **Residuals Plot**!


In [None]:
visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data

### Prediction Error Plot

>The [prediction error plot](http://www.scikit-yb.org/en/latest/api/regressor/peplot.html) shows the actual target values to the predicted values generated by the model we selected (here, the statsmodels GLM)

To start, load the features and target data, then, create test and train data.


In [None]:
from sklearn.model_selection import train_test_split

features = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']
target = 'strength'

X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Instantiate the model (here, the statsmodel GLM wrapped-up as a sklearn BaseEstimator) and visualizer.

In [None]:
from yellowbrick.regressor import PredictionError

visualizer = PredictionError(model)

Let's see the **Prediction Error Plot**!


In [None]:
visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data

###  And that's how to get started with Yellowbrick's regression visualizations and statsmodels wrapper!

Code and explanations in this tutorial are from the [Yellowbrick Documentation](http://www.scikit-yb.org/en/develop/index.html). 

**Check it out!**