 <h2 align=center> Machine Learning Visualization Tools </h2>

### About the Dataset:

*Concrete Compressive Strength Dataset*

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. 
- Number of instances 1030
- Number of Attributes 9
- Attribute breakdown 8 quantitative input variables, and 1 quantitative output variable 

The aim of the dataset is to predict concrete compressive strength of high performance concrete (HPC). HPC does not always means high strength but covers all kinds of concrete for special applications that are not possible with standard concretes. Therefore, our target value is:

**Target y**
- Concrete compressive strength [MPa]

In this case the compressive strength is the cylindrical compressive strength meaning a cylindrical sample (15 cm diameter; 30 cm height) was used for testing. The value is a bit smaller than testing on cubic samples. Both tests assess the uniaxial compressive strength. Usually, we get both values if we buy concrete.

To predict compressive strengths, we have these features available:

**Input X**:
- Cement $[\frac{kg}{m^3}]$
- Blast furnace slag $[\frac{kg}{m^3}]$
- Fly ask $[\frac{kg}{m^3}]$
- Water $[\frac{kg}{m^3}]$
- Plasticizer $[\frac{kg}{m^3}]$
- Coarse aggregate $[\frac{kg}{m^3}]$
- Fine aggregate $[\frac{kg}{m^3}]$
- Age $[d]$

### Importing Libraries

In [None]:
# Standard imports
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import warnings
import numpy as np
from pylab import rcParams
import seaborn as sns; sns.set(style="ticks", color_codes=True)
rcParams['figure.figsize'] = 15, 10

warnings.simplefilter('ignore')

### Dataset Exploration

In [None]:
# Load the data

df = pd.read_csv('../input/concrete.csv')
df.head()

In [None]:
df.describe()

### Preprocessing the Data

In [None]:
# Specify the features and target of interest
features = ["cement","slag","ash","water","splast","coarse","fine","age"]
target = 'strength'
# Get the X and y data from the DataFrame
X = df[features]
y = df[target]

### Pairwise Scatterplot

In [None]:
sns.pairplot(df);

### Feature Importances

###### The feature engineering process involves selecting the minimum required features to produce a valid model because the more features a model contains, the more complex it is (and the more sparse the data), therefore the more sensitive the model is to errors due to variance. A common approach to eliminating features is to describe their relative importance to a model, then eliminate weak features or combinations of features and re-evalute to see if the model fairs better during cross-validation.

In [None]:
#from yellowbrick.features.importances import FeatureImportances
from yellowbrick.model_selection import FeatureImportances
from sklearn.linear_model import Lasso

# Create a new figure
fig = plt.figure()
ax = fig.add_subplot()

# Title case the feature for better display and create the visualizer
labels = list(map(lambda s: s.title(), features))
viz = FeatureImportances(Lasso(), ax=ax, labels=labels, relative=False)

# Fit and show the feature importances
viz.fit(X, y)
viz.poof()

### Target Visualization

##### Frequently, machine learning problems in the real world suffer from the curse of dimensionality; you have fewer training instances than you’d like and the predictive signal is distributed (often unpredictably!) across many different features.Sometimes when the your target variable is continuously-valued, there simply aren’t enough instances to predict these values to the precision of regression. In this case, we can sometimes transform the regression problem into a classification problem by binning the continuous values into makeshift classes.To help the user select the optimal number of bins, the BalancedBinningReference visualizer takes the target variable y as input and generates a histogram with vertical lines indicating the recommended value points to ensure that the data is evenly distributed into each bin.

In [None]:
from yellowbrick.target import BalancedBinningReference

# Instantiate the visualizer
visualizer = BalancedBinningReference()

visualizer.fit(y)          # Fit the data to the visualizer
visualizer.poof()          # Draw/show/poof the data

### Evaluating Lasso Regression

##### A prediction error plot shows the actual targets from the dataset against the predicted values generated by our model. This allows us to see how much variance is in the model. Data scientists can diagnose regression models using this plot by comparing against the 45 degree line, where the prediction exactly matches the model.

In [None]:
from yellowbrick.regressor import PredictionError
from sklearn.model_selection import train_test_split

In [None]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
visualizer = PredictionError(Lasso(), size=(800, 600))
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)

# Call finalize to draw the final yellowbrick-specific elements
visualizer.finalize()

# Get access to the axes object and modify labels
visualizer.ax.set_xlabel("measured concrete strength")
visualizer.ax.set_ylabel("predicted concrete strength");

### Visualization of Test-set Errors

###### Using YellowBrick we can show the residuals (difference between the predicted value and the truth) both for the training set and the testing set (respectively blue and green).
###### A common use of the residuals plot is to analyze the variance of the error of the regressor. If the points are randomly dispersed around the horizontal axis, a linear regression model is usually appropriate for the data; otherwise, a non-linear model is more appropriate.

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = ResidualsPlot(Lasso(), size=(800,600))

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data

### Task 9: Cross Validation Scores

##### Generally we determine whether a given model is optimal by looking at it’s F1, precision, recall, and accuracy (for classification), or it’s coefficient of determination (R2) and error (for regression). However, real world data is often distributed somewhat unevenly, meaning that the fitted model is likely to perform better on some sections of the data than on others. Yellowbrick’s CVScores visualizer enables us to visually explore these variations in performance using different cross validation strategies.
##### Cross-validation starts by shuffling the data (to prevent any unintentional ordering errors) and splitting it into k folds. Then k models are fit on k−1k of the data (called the training split) and evaluated on 1k of the data (called the test split). The results from each evaluation are averaged together for a final score, then the final model is fit on the entire dataset for operationalization.

In [None]:
from sklearn.model_selection import KFold
from yellowbrick.model_selection import CVScores

# Create a new figure and axes
_, ax = plt.subplots()

cv = KFold(12)

oz = CVScores(
    Lasso(), ax=ax, cv=cv, scoring='r2', size=(800,500)
)

oz.fit(X_train, y_train)
oz.poof();

### Learning Curves

##### A learning curve shows the relationship of the training score versus the cross validated test score for an estimator with a varying number of training samples. This visualization is typically used to show two things:
1. How much the estimator benefits from more data (e.g. do we have “enough data” or will the estimator get better if used in an online fashion).
2. If the estimator is more sensitive to error due to variance vs. error due to bias.

##### If the training and cross-validation scores converge together as more data is added (shown in the left figure), then the model will probably not benefit from more data. If the training score is much greater than the validation score then the model probably requires more training examples in order to generalize more effectively.

##### The curves are plotted with the mean scores, however variability during cross-validation is shown with the shaded areas that represent a standard deviation above and below the mean for all cross-validations. If the model suffers from error due to bias, then there will likely be more variability around the training score curve. If the model suffers from error due to variance, then there will be more variability around the cross validated score.

In [None]:
from yellowbrick.model_selection import LearningCurve
from sklearn.linear_model import LassoCV
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10

# Create the learning curve visualizer
sizes = np.linspace(0.3, 1.0, 10)

# Create the learning curve visualizer, fit and poof
viz = LearningCurve(LassoCV(), train_sizes=sizes, scoring='r2')
viz.fit(X, y)
viz.poof()

### Hyperparamter Tuning


The `AlphaSelection` Visualizer demonstrates how different values of alpha influence model selection during the regularization of linear models.

In [None]:
from yellowbrick.regressor import AlphaSelection

In [None]:
alphas = np.logspace(-10,1,400)

model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model, size=(800,600))

visualizer.fit(X,y)
visualizer.poof();