### Generating data for statistic testing

Over the next few weeks we will continue to learn to apply new statistical tests using tools like machine-learning, maximum likelihood, linear regression, and Bayesian statistics. In most of these, we will have a clear  *generative model* defined that we are trying to fit. In others, the generative model may be more abstract, such as when analyzing a set of images, but we still know that the images can be represented as a certain type of data that can be generated. It is important to think deeply about the structure of your data and how your analytical tools are extracting information from it. The easiest way to do this, both as a learning tool, and to validate for yourself and others that your methods are working as intended, is to generate simulated data to run on your analyses as a test case. 

In [85]:
import numpy as np
import pandas as pd
import toyplot


### An example
Let's say that we are trying to fit a linear model that has an intercept and a slope. We can use the numpy random library to easily generate a large set of data that is genarated from a defined intercept and slope which we can then use in our model. We expect the model should return the pre-defined intercept and slope that we used to create the data. If they match, then we have validated that the method is working how we expect it to. We may want to test several generated data sets to be sure. 

Let's use an example linear regression where the intercept is a variable alpha ($\alpha$), the slope is a variable beta ($\beta$), there is a component of noise in the data ($\sigma$). 

$ y = \beta X + \alpha + \sigma $

In [265]:
# the true values we will use
true_beta = 4.5
true_alpha = -1.5
noise = 2.0

# generate a bunch of observations
nsize = 2000
xgen = np.random.normal(0, 2, nsize)
sgen = np.random.normal(0, noise, nsize)
ygen = true_beta * xgen + true_alpha + sgen

#### 1. Examine the data table

In [266]:
# dataframes may be used just to view the data nicely
data = pd.DataFrame({
    "X": xgen,
    "y": ygen,
})
data.head()

Unnamed: 0,X,y
0,0.65374,1.863593
1,1.829484,8.049884
2,-0.475089,-1.384412
3,-1.224045,-7.939574
4,-0.662908,-4.039893


#### 2. Examine the data using plotting

In [267]:
# plot the data
toyplot.scatterplot(xgen, ygen, height=250, width=300, size=3);

### Applying machine learning 

The general workflow for scikit-learn involves 4 things: 
1.  prepare the data
2. initialize a Model class instance
3. fit training data with the model
4. predict test data with the model

#### 1. Preparing the data
The features must be a (nsamples, nfeatures) array, and the labels should be a (nsamples, 1) array. You should also split the data set into a training set and a test set. The training data is what the model will be optimized to make predictions for and the test set is used to validate how well it performs. 

In [268]:
# how many samples to hold back for testing
tsize = 200

In [269]:
# convert to a 2d array
X = data.X.values.ravel()[:, None]

# separate test from training
X_test = X[:tsize]
X_train = X[tsize:]

# show
print(X.shape)
print(X[:5])

(2000, 1)
[[ 0.6537397 ]
 [ 1.82948391]
 [-0.47508909]
 [-1.22404533]
 [-0.66290781]]


In [270]:
# convert to a 1d array
y = data.y.values

# separate test from training
y_test = y[:tsize]
y_train = y[tsize:]

# show
print(y.shape)
print(y[:5])

(2000,)
[ 1.86359339  8.0498843  -1.38441167 -7.93957383 -4.03989257]


#### 2. initialize a model instance

In [271]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)

#### 3. fit the model

In [272]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

#### 4. predict y for new data X

In [292]:
# get predicted 'y' values for held back data X_test
yfit = model.predict(X_test)

In [304]:
# compute r2 from comparing predicted y to actual y
from sklearn.metrics import r2_score, mean_squared_error

results ={
    "R2": r2_score(yfit, y_test),
    "MSE": mean_squared_error(yfit, y_test),
}
print(results)

{'MSE': 3.3641599597938945, 'R2': 0.95736018802849909}


### Examine model parameters
Validate how well our model matched to our simulated values

In [291]:
pd.DataFrame({
    "Beta": [true_beta, model.coef_[0]],
    "alpha": [true_alpha, model.intercept_],
    }, 
    index=["true", "estimated"])

Unnamed: 0,Beta,alpha
true,4.5,-1.5
estimated,4.470498,-1.508469


### Plot model fit

In [276]:
# build canvas
c = toyplot.Canvas(height=300, width=350)
a = c.cartesian()

# add training and test data points
a.scatterplot(X_train[:, 0], y_train, size=4, opacity=0.5);
a.scatterplot(X_test[:, 0], y_test, size=4, opacity=0.5);

# show that fitted line
a.plot(X_test[:, 0], yfit, color='black', style={"stroke-width": 2.5});