### Generating data for statistic testing

Over the next few weeks we will continue to learn to apply new statistical tests using tools like machine-learning, maximum likelihood, linear regression, and Bayesian statistics. In most of these, we will have a clear  *generative model* defined that we are trying to fit. In others, the generative model may be more abstract, such as when analyzing a set of images, but we still know that the images can be represented as a certain type of data that can be generated. It is important to think deeply about the structure of your data and how your analytical tools are extracting information from it. The easiest way to do this, both as a learning tool, and to validate for yourself and others that your methods are working as intended, is to generate simulated data to run on your analyses as a test case. 

In [21]:
import numpy as np
import pandas as pd
import toyplot


### An example
Let's say that we are trying to fit a linear model that has an intercept and a slope. We can use the numpy random library to easily generate a large set of data that is genarated from a defined intercept and slope which we can then use in our model. We expect the model should return the pre-defined intercept and slope that we used to create the data. If they match, then we have validated that the method is working how we expect it to. We may want to test several generated data sets to be sure. 

Let's use an example linear regression where the intercept is a variable alpha ($\alpha$), the slope is a variable beta ($\beta$), there is a component of noise in the data ($\sigma$). 

$ y = \beta X + \alpha + \sigma $

In [22]:
# the true values we will use
true_beta = 4.5
true_alpha = -1.5
noise = 2.0

# generate a bunch of observations
nsize = 1000
xgen = np.random.normal(0, 2, nsize)
sgen = np.random.normal(0, noise, nsize)
ygen = true_beta * xgen + true_alpha + sgen

#### 1. Examine the data table

In [23]:
# dataframes may be used just to view the data nicely
data = pd.DataFrame({
    "X": xgen,
    "y": ygen,
})
data.head()

Unnamed: 0,X,y
0,-0.332839,-4.626485
1,0.171598,-0.928649
2,1.52843,8.874191
3,5.302041,22.573868
4,-2.273653,-14.132021


#### 2. Examine the data using plotting

In [24]:
# plot the data
toyplot.scatterplot(xgen, ygen, height=250, width=300, size=3);

### Applying machine learning 

The general workflow for scikit-learn involves 4 things: 
1.  prepare the data
2. initialize a Model class instance
3. fit training data with the model
4. predict test data with the model

#### 1. Preparing the data
The features must be a (nsamples, nfeatures) array, and the labels should be a (nsamples, 1) array. You should also split the data set into a training set and a test set. The training data is what the model will be optimized to make predictions for and the test set is used to validate how well it performs. 

In [26]:
# how many samples to hold back for testing
tsize = 200

In [7]:
# convert to a 2d array
X = data.X.values[:, None]

# separate test from training
X_test = X[:tsize]
X_train = X[tsize:]

# show
print(X.shape)
print(X[:5])

(1000, 1)
[[ 0.89915502]
 [-2.64436506]
 [-0.64133517]
 [-0.83443503]
 [-2.26733829]]


In [8]:
# convert to a 1d array
y = data.y.values

# separate test from training
y_test = y[:tsize]
y_train = y[tsize:]

# show
print(y.shape)
print(y[:5])

(1000,)
[ -0.85481375 -15.50599853  -3.40335884  -2.16470431 -12.79559331]


#### 2. initialize a model instance

In [37]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

#### 3. fit the model

In [38]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

#### 4. predict y for new data X

In [40]:
# get predicted 'y' values for held back data X_test
yfit = model.predict(X_test)


### Assess the goodness of fit (score)
This is the typical way to assess how well a machine learning algorithm has performed, and is called validation. We are asking how close the predicted values of y are to the known values of y, which we know because we held back these data from the analyses (the test data set). Below is the r2 value and mean squared error. 

In [18]:
# compute r2 from comparing predicted y to actual y
from sklearn.metrics import r2_score, mean_squared_error

results ={
    "R2": r2_score(yfit, y_test),
    "MSE": mean_squared_error(yfit, y_test),
}
print(results)

{'R2': 0.94803145259318755, 'MSE': 4.2451203065517227}


### Assess that we've run the model correctly

Here, because we simulated the data, we can perform another type of validation, which is to ask whether the parameters inferred by the model match to those that we used in our generative model to simulate the data. If so, then we can have improved confidence that we have chosen an appropriate model. The practice of creating simulated data to apply to a model is a useful exercise in trying to understand how the model analyzes the features of the data. Here we generated data with a clear linear relationship, and so clearly a LinearRegression model was appropriate. But other models could predict the y values accurately as well, using different approaches that similarly aim to minimize the error on the training data set (X_train) when some function is applied to it compared to the training labels (y_train). In LinearRegression the function to minimize is the squared error, which is the same process employed in any ordinary least squares model. In other models we will see the process can be different, but lead to similar results. 

In [19]:
pd.DataFrame({
    "Beta": [true_beta, model.coef_[0]],
    "alpha": [true_alpha, model.intercept_],
    }, 
    index=["true", "estimated"])

Unnamed: 0,Beta,alpha
true,4.5,-1.5
estimated,4.501052,-1.512259


### Plot model fit

In [41]:
# build canvas
c = toyplot.Canvas(height=300, width=350)
a = c.cartesian()

# add training and test data points
a.scatterplot(X_train[:, 0], y_train, size=4, opacity=0.5);
a.scatterplot(X_test[:, 0], y_test, size=4, opacity=0.5);

# show that fitted line
a.plot(X_test[:, 0], yfit, color='black', style={"stroke-width": 2.5});