# End-To-End Machine Learning Project

This notebook is demostrating an end to end Machine Learning example project to build a model of housing prices for a real state company in California. It will walk you through the foolowing steps:
1. Frame the problem.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms. 
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

## Frame the Problem

In this phase we should seek for answers to the following questions:
1. What exactly is the business objective? 
  * To use the model output to fed another Machine Learning system.
2. Are they any previous solutions, and if so, how accurate are they and how they were built?
  * Complex rule based manually estimated by market experts.
  * The current error rate is about 15%.
3. How does the company expect to use and benefit from this model?
  * Decrease the cost, time and error of the current system.
4. What data sources are currently available?
  * California Census Data, containing metrics such as the population, median income, median housing price, and so on for each block group in California.
5. Is it a supervised, unsupervised or reinforcement learning?
  * Supervised learning since the data contained the labels.
6. What king of task it is?
  * A Multivariate Regression task since the model needs to predict a continous value out of multiple features.
7. Should we use batch learning or online learning techniques?
  * Batch Learning since there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory.

## Select a Performace Measure

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It measures the standard deviation of the errors the system makes in its predictions. The mathematical formula to calculate the RMSE is the following:

$RMSE(X,f) = \sqrt[2]{\frac{1}{n}\sum_{i=1}^{n}(f(x^{(i)}) - y^{(i)})^2}$

Where:

$X$: is a matrix of n rows (instances) and m columns (features).

$f$: is the predicting function or model.

$x^{(i)}$: is the row $i$ of the matrix $X$.

$f(x^{(i)})$: is the predicted value for the point $x^{(i)}$ in the $m$ dimentional space.

$y^{(i)}$: is the real value of the for the row $i$ of the matrix $X$.



## Verify the Assumptions

Our approach is assuming that the Machine Learning System to be fed by our model is expecting a continous quantitative value, reason why we classified the task as a regression. What if they are actually expecting a range or interval instead of a value. In this case our task would have being a classification instead of regression. Let supposed that our assumption was comfirmed.

## Preparing the environment

In [None]:
from tarfile import TarFile
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer, StandardScaler, label_binarize
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import random
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.precision', 3)
pd.options.display.float_format = "{:,.3f}".format

## Getting the Data

In [None]:
DATA_DIR="/data"
FILE_NAME='housing.tgz'

def fetch_data(dataFile):
    with TarFile.open(dataFile) as compress_file:
         # Decompressing the file and loading it into a Python Data Frame
        file = compress_file.extractfile(compress_file.members.pop())
        data=pd.read_csv(file)
    return(data)

data=fetch_data(os.path.join(DATA_DIR,FILE_NAME))

## Understanding the Data

### First 10 observations

In [None]:
data.head(10)

By looking at the first 10 observations it seems that all features are numeric with the exception of ocean_proximity.

### Features data types and counts

In [None]:
data.info()

From these results we can observed the following:
* The dataset contains a total of 20,640 observations and 10 features
* All features are numberic (float) except ocean proximity
* The only feature with missing values is total_bedrooms

### Quantitative features

#### Descriptive statistics

In [None]:
data.describe()

The describe function of the Data Frame object provides several statistics including the count (frequency), min, max, mean, standard deviation (std) and the first, second and third quartiles (25% 50%, 75% percentiles respectively). From these results we can observed the following:
* The median_income feature was scaled.
* The housing_median_age and median_house_value was capped.
* The quartiles of these features seems to indicate that their distributions are not normal but rather they have heavy tails.
* All the features have different scales.

#### Distribution

In [None]:
data.hist(bins=50, figsize=(20,15))
plt.show()

This visualization confirmed the heavy tails (right skewed) in the features distribution and that the median_house_value and median_house age features were capped.

### Qualitative features

#### Frequency

In [None]:
counts=data.ocean_proximity.value_counts()
pd.DataFrame({'Labels':counts.index,
              'Frequency':counts.values,
              "Relative Frequency":np.divide(counts.values,data.shape[0])
             })

## Split the data into Train and Test Datasets

In [None]:
train, test = train_test_split(data,test_size=0.2,random_state=100)
train_set = train.copy()
test_set = test.copy()

### Verifying Sampling Bias

In [None]:
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.hist(data.median_house_value,bins=50,label='Data Set')
ax1.hist(train_set.median_house_value,bins=50,label='Train Set')
ax2.hist(data.median_income,bins=50,label='Data Set')
ax2.hist(train_set.median_income,bins=50,label='Train Set')
ax1.legend(loc='best')
ax2.legend()
ax1.set_title('Median House Value')
ax2.set_title('Median Income Value')
plt.show()

We can see in the previous visualization that both variables are equally distributed in the training set when compared with the full data set. This indicates that our training set is representative of our full dataset. However, let also implement a stratified sampling by median income and determine which of the two sampling methodologies has the lower sampling bias.

First, we need to discretizes the median income feature to create categories based on quantiles.

In [None]:
data.loc[:,'median_income_cat']=pd.qcut(data.median_income,10)
data.median_income_cat.value_counts()

Now let's performed the stratified sampling based of the median house value categories.

In [None]:
split = StratifiedShuffleSplit(n_splits=1,test_size=0.20,random_state=100)
for train_index, test_index in split.split(data,data.median_income_cat):
    strat_train_set = data.iloc[train_index]
    strat_test_set = data.iloc[test_index]

Finally, let's compare the sampling error for both the simple ramdom sample and the stratified sample. To do this we first need to create the same median house value categories for our simple random sample.

In [None]:
train_set.loc[:,'median_income_cat']=pd.cut(train_set.median_income,data.median_income_cat.values.categories)
sampling_error=pd.DataFrame({'Full_Dataset':data.median_income_cat.value_counts()/len(data),\
                             'Random_Sample':train_set.median_income_cat.value_counts()/len(train_set),\
                             'Strat_Sample':strat_train_set.median_income_cat.value_counts()/len(strat_train_set)})
sampling_error.loc[:,'Random_Error(%)']=np.divide(np.subtract(sampling_error.Random_Sample,sampling_error.Full_Dataset),
                                                  sampling_error.Full_Dataset)*100
sampling_error.loc[:,'Strat_Error(%)']=np.divide(np.subtract(sampling_error.Strat_Sample,sampling_error.Full_Dataset),
                                                 sampling_error.Full_Dataset)*100
sampling_error

The sample errors suggests that the stratified sampling has smaller bias than the random sample. Therefore, we will continue working with the stratified training and testing datasets. Before moving on let’s remove the recent added median house value category feature from all the datasets we added to.

In [None]:
for set in[data,train_set,strat_test_set,strat_train_set]:
    del set['median_income_cat']

## Discover and Visualize the Training Data to Gain Insights

In [None]:
strat_train_set_copy = strat_train_set.copy()
strat_train_set_copy.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=strat_train_set_copy["population"]/100, \
                    figsize=(15,10),label="population", c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
) 
plt.legend()
plt.show()

This image tells you that the housing prices are very much related to the location (e.g., close to the ocean) and to the population density.

### Looking for Correlation

In [None]:
corr_matric = strat_train_set_copy.corr()
corr_matric.median_house_value.sort_values(ascending=False)

The most promising attribute to predict the median house value is the median income, so let’s zoom in on their correlation scatterplot.

In [None]:
strat_train_set_copy.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1,figsize=(8,8))

This plot reveals the following: 
1. The correlation is indeed very strong
2. There are more median house value caps than we noticed earlier at 500,000 dollars. For instance, notice the horizontal line around 450,000 and another around 350,000. 

We might need to consider removing the corresponding block groups to prevent learning algorithms from learning to reproduce these data quirks.

## Feature Engineering

Now let's construct new features out of the existing ones and look at their correlation with the median house value feature.

In [None]:
strat_train_set_copy["rooms_per_household"] = strat_train_set_copy["total_rooms"]/strat_train_set_copy["households"]
strat_train_set_copy["bedrooms_per_room"] = strat_train_set_copy["total_bedrooms"]/strat_train_set_copy["total_rooms"]
strat_train_set_copy["population_per_household"]=strat_train_set_copy["population"]/strat_train_set_copy["households"]

In [None]:
strat_train_set_copy.corr().median_house_value.sort_values(ascending=False)

The new bedrooms_per_room attribute has a higher correlation with the median_house_value feature than the total_bedrooms or total_rooms features. Since it's correlation value is negative, this indicates that houses with a lower bedroom/room ratio tend to be more expensive. The rooms_per_household feature is also more informative than the total_rooms feature.

## Prepare the Data for Machine Learning Algorithms

In [None]:
X_DF = strat_train_set.drop('median_house_value',axis=1)
Y = strat_train_set['median_house_value'].copy()

### Data Cleaning

Earlier we noticed that the feature total_bedrooms include missing values. Let's see how many observations in the training have the same problem.

In [None]:
print('total_bedrooms missing values: {0}'.format(X_DF.total_bedrooms.isnull().sum()))

Most Machine Learning algorithms cannot work with features with missing values, so we need to handle them before training the algorithms. To handle these missing values, we have the following three options:
1. Removing the instances with the missing values.
2. Removing the entire feature with missing values from the training and test sets.
3. Impute the missing values.

Let's impute the missing values of all features by using the median of the respective feature. To accomplish this the following steps were used:
1. Create a data frame with only the quantitative features by dropping the ocean_proximity feature
2. Use the SimpleImputer class with the strategy set to median to calculate the median of all the quantitative features
3. Use the transform method of this class to replace the missing values of the features with the corresponding median value

In [None]:
X_DF_Quant = strat_train_set.drop('ocean_proximity',axis=1) 
imputer = SimpleImputer(strategy='median')
imputer.fit(X_DF_Quant)
X_DF_Quant = pd.DataFrame(imputer.transform(X_DF_Quant),columns=X_DF_Quant.columns)
print('total_bedrooms missing values: {0}'.format(X_DF_Quant.total_bedrooms.isnull().sum()))

### Handling Text and Categorical Attributes

Most Machine Learning Algorithms are executed on a Matrix of numbers. Currently our dataset contains one non numeric feature. So we need to find a numeric representation for this feature. The LabelBinarizer class does exactly that by representing the feature as a matrix. The columns of this matrix represent the different values of the feature and the rows represent the observations. The value observed in a particular observation get encoded as a 1 and all other cell entries in that row get encoded as 0. This is call **one-hot-encoding**.

In [None]:
encoder = LabelBinarizer()
ocean_prox = strat_train_set.ocean_proximity.copy()
ocean_prox_encoded = encoder.fit_transform(ocean_prox)
ocean_prox_encoded

### Custom Transformations

To create your own transformers so that they work seamlessly with other Scikit-Learn functionalities (such as pipelines), we need to create a class that implements the following 3 methods:
1. <u>fit</u>: returning self
2. <u>transform</u>: to apply the learned transformation
3. <u>fit_transform</u>: to learn and apply the trasformation. We can inherit this method by extending the TransformerMixin class. If we also extend the BaseEstimator and avoid the \*args and \*\*kargs in the constructor, we will get additional methods like the get_params and set_params methods.

To illustrate this let's create a custom transformer to add the additional features we calculated previously.

In [None]:
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    
    def __init__(self, add_bedrooms_per_room = True):
        
        self.add_bedrooms_per_room = add_bedrooms_per_room
        self.rooms_ix = pd.Index(strat_train_set.columns).get_loc('total_rooms')
        self.bedrooms_ix = pd.Index(strat_train_set.columns).get_loc('total_bedrooms')
        self.population_ix = pd.Index(strat_train_set.columns).get_loc('population')
        self.household_ix = pd.Index(strat_train_set.columns).get_loc('households')
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        rooms_per_household = X[:,self.rooms_ix] / X[:,self.household_ix] 
        population_per_household = X[:,self.population_ix] / X[:,self.household_ix] 
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:,self.bedrooms_ix] / X[:,self.rooms_ix]
            return np.c_[X,rooms_per_household,population_per_household,bedrooms_per_room]
        else:
            return np.c_[X,rooms_per_household,population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) 
X_Quant_Extra_Attr = attr_adder.transform(X_DF_Quant.values)
X_Quant_Extra_Attr

### Feature Scaling

Feature scaling is one of the most important transformations that needs to be apply to the data prior to training any learning algorithms. The reason for this is that must Machine Learning algorithms don’t perform well with quatitative attributes that have very different scales. The two must common methods to scale attribute are the following:
1. Min-Max Scaling (Normalization): The attributes are scaled to fall in the interval [0,1]. The mathematical formula to scale the value i of the attribute X is given by $\frac{X_i - min(X)}{max(X) - min(X)}$
2. Standardization: Scale the attribute so that the mean is equal to 0 and the variance is equal to 1. The mathematical formula to scale the value i of the attribute X is given by $\frac{X_i - mean(X)}{std(X)}$

In [None]:
std_scaler = StandardScaler()
X_Quant_Extra_Attr_Scaled = std_scaler.fit_transform(X_Quant_Extra_Attr)
print(np.apply_along_axis(np.mean,0,X_Quant_Extra_Attr_Scaled))
print(np.apply_along_axis(np.std,0,X_Quant_Extra_Attr_Scaled))

### Transformation Pipelines

So far we had applied the following transformations to our training set:
1. Separeted the quantitative and qualitative attributes.
2. Inputed missing values.
3. Added additional attributes.
4. Scaled the quantitatives attributes.

To applied this trasnformation in order and in a more automated way we can use the transformations pipelines available in Scikit-Learn.

In [None]:
class AttributeSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.attribute_names].values

class MyLabelBinarizer(BaseEstimator, TransformerMixin):
    
    def __init__(self,classes):
        self.classes_ = classes
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, x, y=None):
        return label_binarize(x,classes=self.classes_)

In [None]:
quant_pipeline = Pipeline([('selector',AttributeSelector(list(X_DF.select_dtypes(exclude=['object','category'])))),
                            ('inputer',SimpleImputer(strategy='median')),
                           ('attr_adder',CombinedAttributesAdder())
                           ,('std_scaler',StandardScaler())])

qual_pipeline = Pipeline([('selector',AttributeSelector(list(X_DF.select_dtypes(include=['object','category'])))),
                           ('label_binarizer',MyLabelBinarizer(X_DF.ocean_proximity.unique()))])

data_prep_pipeline = FeatureUnion(transformer_list=[('quant_pipeline',quant_pipeline),
                                               ('qual_pipeline',qual_pipeline)])
X_Prepared = data_prep_pipeline.fit_transform(X_DF)
print(X_Prepared.shape)

## Training and Evaluating using the Training Set

### Training a Linear Regresssion

In [None]:
linear_reg = LinearRegression()
linear_reg.fit(X_Prepared, Y)

### Making Predictions

#### Linear Regression

The following script select 5 random observation from the training set used them to make predictions. These predictions along with the real values are printed.

In [None]:
small_index = random.sample(list(X_DF.index),5)
small_sample = X_DF.loc[small_index,:]
small_sample_prepared = data_prep_pipeline.fit_transform(small_sample)
print('Prediction: {0}'.format(linear_reg.predict(small_sample_prepared)))
print('Real Values: {0}'.format(list(Y[small_index])))

Now let's make the predictions using the entire training set and calculate the training error.

In [None]:
linear_pred = linear_reg.predict(X_Prepared)
linear_rmse = np.sqrt(mean_squared_error(linear_pred,Y))
print('Linear model training error: {0}'.format(linear_rmse))

The training error of the Linear Regression model is approximately \$68,247. This error is not a satisfying one.

#### Decision Tree

In [None]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_Prepared,Y)
tree_pred = tree_reg.predict(X_Prepared)
tree_rmse = np.sqrt(mean_squared_error(tree_pred,Y))
print('Decision Tree training error: {0}'.format(tree_rmse))

Can this be the best model since it has a 0 RMSE? It is possible that this model has badly overfit the data. How can we verify this? We don't want to touch the test set until we identified the most accurate model that we are confident about, so we need to use part of the training set for training, and part for model validation.

### K-fold Cross Validation

Scikit-Learn cross-validation features expect an utility function (greater is better) rather than a cost function (lower is better). For this reason the scoring parameter is set to the negative mean square error.

The following class is used to train and cross validate Machine Learning models and stored the results in an object.

In [None]:
class ML_Model():
    
    def __init__(self,X,Y,model,folds):
        self.model_reg_ = model.fit(X,Y)
        self.model_pred = model.predict(X)
        self.model_rmse = np.sqrt(mean_squared_error(self.model_pred,Y))
        self.model_cross_scores = cross_val_score(self.model_reg_,X,Y,scoring='neg_mean_squared_error',cv=folds)
        self.model_cross_rmse = np.sqrt(-self.model_cross_scores)
        self.model_stats=pd.Series(self.model_cross_rmse).describe()[1:]
        self.model_stats=pd.Series({'training_error':self.model_rmse}).append(self.model_stats)

The following script is using the previous class to train and cross validate the following 4 models:
1. Linear Regression
2. Decision Tree
3. Support Vertor Machine with a linear kernel
4. Random Forest

The statistics of these models are then combined into a pandas Data Frame object. This following code block will take several minutes since is training 4 models and validate them using a 4 fold cross validation. 

In [None]:
decision_tree_model = ML_Model(X_Prepared,Y,tree_reg,4)
linear_model = ML_Model(X_Prepared,Y,linear_reg,4)
svm_model = ML_Model(X_Prepared,Y,SVR(kernel='linear'),4)
random_forest_model = ML_Model(X_Prepared,Y,RandomForestRegressor(random_state=42),4)
pd.DataFrame({'Linear Regression':linear_model.model_stats,
              'Decision Tree':decision_tree_model.model_stats,
              'SVM':svm_model.model_stats,
              'Random Forest':random_forest_model.model_stats
             })

From these results we can observed the following facts:
* The Decision Tree cross validation mean error is far from it's training error. This is indicating that this model is heavily overfitting the training data.
* The Linear model performed better than the Decision Tree according to the cross-validation results.
* Random Forests seems very promising since it's cross validation error is smaller than all other models. However, this model is still overfitting the training set since the training error is lower than the cross validation error. Possible solutions for this are to simplify the model, constrain it (regularization), or get more training data. 

### Hyperparameters

GridSearchCV evaluate all the possible combinations of hyperparameter values we want to experiment with and using cross-validationto it will determine the optimal one. The following code searches for the best combination of hyperparameter values for the Random Forest Regressor. When you have no idea what value to set hyperparameter to, a simple approach is to try out consecutive powers of 10 or a smaller number for higher fine-grained search.

In [None]:
param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [10, 20, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [10, 20], 'max_features': [2, 4, 6]},
  ]

# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(random_forest_model.model_reg_, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_Prepared, Y)

This grid search explored 18 combinations of Random Forest Regressor hyperparameter values and trained each model five times. In other words, we performed 18 × 5 = 90 rounds of training. The following code obtains the combination of hyperparameters with the lowest cross validation estimated error.

In [None]:
grid_search.best_estimator_

The cross validation errors are available in the instance variable cv_results as the foolowing code illustrates:

In [None]:
gridsearch_cv_results=grid_search.cv_results_
for mean_score, params in zip(gridsearch_cv_results['mean_test_score'],gridsearch_cv_results['params']):
    print(np.sqrt(-mean_score), params)

The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use **RandomizedSearchCV** instead. This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This approach has two main benefits:
1. If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
2. You have more control over the computing cost you want to allocate to hyperparameter search, simply by setting the number of iterations.

## Analyze the best models and their errors

### Exploring the feature weight on making predictions

In [None]:
quantitative_features=list(X_DF.drop(['ocean_proximity'],axis=1).columns)
extra_features=['rooms_per_household','population_per_household','bedrooms_per_room']
qualitative_features=list(encoder.classes_)
all_features=quantitative_features+extra_features+qualitative_features
feature_weights=grid_search.best_estimator_.feature_importances_
sorted(zip(feature_weights, all_features), reverse=True)

Looking at these results we may want to try dropping some of the less important features. For instance, only one ocean_proximity category seems useful (INLAND), so we could consider dropping the others.
We should also look at the specific errors that model makes, then try to understand why it makes them and what could we do to fix them. Perhaps by adding extra features or, on the contrary, getting rid of uninformative ones, dealing with outliers. The following class is used to select the Top k features with higher weights.

In [None]:
class TopKFeatureSelector(BaseEstimator, TransformerMixin):
    
    def top_k_feature_index(self,arr, k):
        return np.sort(np.argpartition(np.array(arr), -k)[-k:])
    
    def __init__(self, feature_weights, k):
        self.feature_weights = feature_weights
        self.k = k
        
    def fit(self, X, y=None):
        self.feature_indices_ = self.top_k_feature_index(self.feature_weights, self.k)
        return self
    
    def transform(self, X):
        return X[:, self.feature_indices_]

In [None]:
full_pipeline = Pipeline([
    ('preparation', data_prep_pipeline),
    ('feature_selection', TopKFeatureSelector(feature_weights, 8))
])
X_Prepared=full_pipeline.fit_transform(X_DF)

In [None]:
ML_Model(X_Prepared,Y,RandomForestRegressor(random_state=42),4).model_stats

These results show that the Random Forrest Model performs slightly better when removing the less important features.

## Evaluate Your System on the Test Set

The model with the lowest error was Random forest with the parameter tunning. Therefore, let's evaluate this model using the test set.

In [None]:
final_model = grid_search.best_estimator_
X_Test = strat_test_set.drop("median_house_value", axis=1) 
Y_Test = strat_test_set["median_house_value"].copy()
X_Test_Prepared = data_prep_pipeline.transform(X_Test)
final_predictions = final_model.predict(X_Test_Prepared)
final_mse = mean_squared_error(Y_Test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

This performance is slightly worse than what we measured using cross-validation and that is usually the case in preactice.

## Launch, Monitor, and Maintain Your System

Our model is ready for production, we can do this by following these steps:
1. Plugging the production input data sources and writing tests. 
2. Write monitoring code to check your system’s live performance at regular time intervals and trigger alerts when there are performance decays. Evaluating your system’s performance will require sampling the system’s predictions and evaluating them. This will generally require a human analysis. These analysts may be field experts, or workers on a crowdsourcing platform (such as Amazon Mechanical Turk or CrowdFlower). Either way, you need to plug the human evaluation pipeline into your system.
3. Evaluate the system’s input data quality. Sometimes performance will degrade slightly because of a poor-quality signal (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale), but it may take a while before your system’s performance degrades enough to trigger an alert. If you monitor your system’s inputs, you may catch this earlier. Monitoring the inputs is particularly important for online learning systems.
4. Retrain your models on a regular basis using fresh data. You should automate this process as much as possible. If you don’t, you are very likely to refresh your model only every six months (at best), and your system’s performance may fluctuate severely over time. If your system is an online learning system, you should make sure you save snapshots of its state at regular intervals so you can easily roll back to a previously working state.