## Tutorial Overview

In [1]:
# 1. Data set size Sensitivity Analysis
# 2. Synthetic Prediction Task and Baseline Model
# 3. Sensitivity Analysis of Dataset Size

In [2]:
# How Much Training Data is Required for Machine Learning ?

# There is a strong relationship between training dataset size and model performance.
# especially for nonlinear models. The relationship often involves an improvement 
# in performance to a point and a general reduction in the expected variance of the model 
#  as the datasize is increased

# Knowing this relationship for your Model and dataset can be helpful for a number of reasons,
# Such as
# Evaluate More Models.
# Find a better model
# Decide to gather more

In [3]:
# The make_classification() scikit-learn function can be used to create a synthetic classification dataset
# In this case we will use 20 input features (columns) and generate 1000 sample(rows)


# test classification dataset
from sklearn.datasets import make_classification

# define datasets
x,y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
                         random_state=1)
## Sumarize the dataset
print(x.shape, y.shape)

(1000, 20) (1000,)


In [7]:
# Evaluate a decision tree model on the synthetic classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier

# load dataset
x,y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
                         random_state=1)

# define model evaluation producure 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define the model
model = DecisionTreeClassifier()
# evaluate model
scores = cross_val_score(model, x, y, scoring='accuracy', cv=cv, n_jobs=-1)

# repear performance 
print("Mean Accuracy : %.3f (%.3f)"%(scores.mean(), scores.std()))

Mean Accuracy : 0.823 (0.042)


In [8]:
# Running the example creates the dataset then estimates 
# the performance of the model on the problemusing the chosen test harness.

## Sensitivity Analysis of Dataset Size
**IT raises questions, such as.....**<br>

will the model perform better on more data ? <br>
Does the estimated performance hold on smaller or larger samples from the problem domain ? <br>
How sensitive is model performance to dataset size ? <br>
What is the relation ship of dataset size to model performance ? <br>

In [9]:
# load dataset
def load_dataset(n_samples):
    # define the dataset
    x, y = make_classification(n_samples = int(n_samples), n_features=20, n_informative=15,
                              n_redundant=5, random_state=1)
    return x, y

In [10]:
# We will define a function to evaluate a model on a loaded dataset.
# We define a function that takes a dataset and returns a summary of the performance
# This function is listed below, taking the input and output elements of a dataset and 
# returning the mean and standard deviation of the decision of the tree model on the dataset

# evaluate a model
def evaluate_model(x,y):
    # define model evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # define model
    model = DecisionTreeClassifier()
    
    # evaluate Model
    scores = cross_val_score(model, x,y, scoring="accuracy", cv = cv, n_jobs = -1)
    # Return summaruy stats
    return[scores.mean(), scores.std()]

In [None]:
# We can define a range of different range of different dataset sizes to evaluate.
# 