In [None]:
import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_columns = 20
pd.options.display.max_rows = 20
pd.options.display.max_colwidth = 80
np.set_printoptions(precision=4, suppress=True)

  A common workflow for model development is to use pandas for data loading and
 cleaning before switching over to a modeling library to build the model itself. An
 important part of the model development process is called feature engineering in
 machine learning. This can describe any data transformation or analytics that extract information from a raw dataset that may be useful in a modeling context. The data
 aggregation and GroupBy tools we have explored in this book are used often in a
 feature engineering context.
 While details of “good” feature engineering are out of scope for this book, I will
 show some methods to make switching between data manipulation with pandas and
 modeling as painless as possible.


 The point of contact between pandas and other analysis libraries is usually NumPy
 arrays. To turn a DataFrame into a NumPy array, use the to_numpy method

In [None]:
data = pd.DataFrame({
    'x0': [1, 2, 3, 4, 5],
    'x1': [0.01, -0.01, 0.25, -4.1, 0.],
    'y': [-1.5, 0., 3.6, 1.3, -2.]})
data
data.columns
data.to_numpy()

array([[ 1.  ,  0.01, -1.5 ],
       [ 2.  , -0.01,  0.  ],
       [ 3.  ,  0.25,  3.6 ],
       [ 4.  , -4.1 ,  1.3 ],
       [ 5.  ,  0.  , -2.  ]])

 To convert back to a DataFrame, as you may recall from earlier chapters, you can pass
 a two-dimensional ndarray with optional column names:

In [None]:
df2 = pd.DataFrame(data.to_numpy(), columns=['one', 'two', 'three'])
df2

Unnamed: 0,one,two,three
0,1.0,0.01,-1.5
1,2.0,-0.01,0.0
2,3.0,0.25,3.6
3,4.0,-4.1,1.3
4,5.0,0.0,-2.0


The to_numpy method is intended to be used when your data is homogeneous—for
 example, all numeric types. If you have heterogeneous data, the result will be an
 ndarray of Python objects:

In [None]:
df3 = data.copy()
df3['strings'] = ['a', 'b', 'c', 'd', 'e']
df3
df3.to_numpy()

array([[1, 0.01, -1.5, 'a'],
       [2, -0.01, 0.0, 'b'],
       [3, 0.25, 3.6, 'c'],
       [4, -4.1, 1.3, 'd'],
       [5, 0.0, -2.0, 'e']], dtype=object)

 For some models, you may wish to use only a subset of the columns. I recommend
 using loc indexing with to_numpy:

In [None]:
model_cols = ['x0', 'x1']
data.loc[:, model_cols].to_numpy()

array([[ 1.  ,  0.01],
       [ 2.  , -0.01],
       [ 3.  ,  0.25],
       [ 4.  , -4.1 ],
       [ 5.  ,  0.  ]])

 Some libraries have native support for pandas and do some of this work for you
 automatically: converting to NumPy from DataFrame and attaching model parameter
 names to the columns of output tables or Series. In other cases, you will have to
 perform this “metadata management” manually.

In [None]:
data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'],
                                  categories=['a', 'b'])
data

Unnamed: 0,x0,x1,y,category
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,a
3,4,-4.1,1.3,a
4,5,0.0,-2.0,b


 If we wanted to replace the 'category' column with dummy variables, we create
 dummy variables, drop the 'category' column, and then join the result:

In [None]:
dummies = pd.get_dummies(data.category, prefix='category',
                         dtype=float)
data_with_dummies = data.drop('category', axis=1).join(dummies)
data_with_dummies

Unnamed: 0,x0,x1,y,category_a,category_b
0,1,0.01,-1.5,1.0,0.0
1,2,-0.01,0.0,0.0,1.0
2,3,0.25,3.6,1.0,0.0
3,4,-4.1,1.3,1.0,0.0
4,5,0.0,-2.0,0.0,1.0


 There are some nuances to fitting certain statistical models with dummy variables.
 It may be simpler and less error-prone to use Patsy (the subject of the next section)
 when you have more than simple numeric columns.

 Patsy is a Python library for describing statistical models (especially linear models)
 with a string-based “formula syntax,” which is inspired by (but not exactly the same
 as) the formula syntax used by the R and S statistical programming languages. It is
 installed automatically when you install statsmodels:


  Patsy is well supported for specifying linear models in statsmodels, so I will focus
 on some of the main features to help you get up and running. Patsy’s formulas are a
 special string syntax that looks like:

  y ~ x0 + x1

   The syntax a + b does not mean to add a to b, but rather that these are terms in the
 design matrix created for the model. The patsy.dmatrices function takes a formula
 string along with a dataset (which can be a DataFrame or a dictionary of arrays) and
 produces design matrices for a linear model:

In [None]:
data = pd.DataFrame({
    'x0': [1, 2, 3, 4, 5],
    'x1': [0.01, -0.01, 0.25, -4.1, 0.],
    'y': [-1.5, 0., 3.6, 1.3, -2.]})
data
import patsy
y, X = patsy.dmatrices('y ~ x0 + x1', data)

 Now we have:

In [None]:
# y
X

DesignMatrix with shape (5, 3)
  Intercept  x0     x1
          1   1   0.01
          1   2  -0.01
          1   3   0.25
          1   4  -4.10
          1   5   0.00
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'x1' (column 2)

 These Patsy DesignMatrix instances are NumPy ndarrays with additional metadata:

In [None]:
# np.asarray(y)
np.asarray(X)

array([[ 1.  ,  1.  ,  0.01],
       [ 1.  ,  2.  , -0.01],
       [ 1.  ,  3.  ,  0.25],
       [ 1.  ,  4.  , -4.1 ],
       [ 1.  ,  5.  ,  0.  ]])

 You might wonder where the Intercept term came from. This is a convention for
 linear models like ordinary least squares (OLS) regression. You can suppress the
 intercept by adding the term + 0 to the model:

In [None]:
patsy.dmatrices('y ~ x0 + x1 + 0', data)[1]

DesignMatrix with shape (5, 2)
  x0     x1
   1   0.01
   2  -0.01
   3   0.25
   4  -4.10
   5   0.00
  Terms:
    'x0' (column 0)
    'x1' (column 1)

 The Patsy objects can be passed directly into algorithms like numpy.linalg.lstsq,
 which performs an ordinary least squares regression:

In [None]:
coef, resid, _, _ = np.linalg.lstsq(X, y, rcond=None)

In [None]:
coef

array([[ 0.3129],
       [-0.0791],
       [-0.2655]])

 The model metadata is retained in the design_info attribute, so you can reattach the
 model column names to the fitted coefficients to obtain a Series, for example:

In [None]:
coef
coef = pd.Series(coef.squeeze(), index=X.design_info.column_names)
coef

 # Data Transformations in Patsy Formulas
 You can mix Python code into your Patsy formulas; when evaluating the formula, the
 library will try to find the functions you use in the enclosing scope:

In [None]:
y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)
X

DesignMatrix with shape (5, 3)
  Intercept  x0  np.log(np.abs(x1) + 1)
          1   1                 0.00995
          1   2                 0.00995
          1   3                 0.22314
          1   4                 1.62924
          1   5                 0.00000
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'np.log(np.abs(x1) + 1)' (column 2)

 Some commonly used variable transformations include standardizing (to mean 0 and
 variance 1) and centering (subtracting the mean). Patsy has built-in functions for this
 purpose:

In [None]:
y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)
X

As part of a modeling process, you may fit a model on one dataset, then evaluate
 the model based on another. This might be a hold-out portion or new data that
 is observed later. When applying transformations like center and standardize, you
 should be careful when using the model to form predications based on new data.
 These are called stateful transformations, because you must use statistics like the
 mean or standard deviation of the original dataset when transforming a new dataset.

 The patsy.build_design_matrices function can apply transformations to new out
of-sample data using the saved information from the original in-sample dataset:

In [None]:
new_data = pd.DataFrame({
    'x0': [6, 7, 8, 9],
    'x1': [3.1, -0.5, 0, 2.3],
    'y': [1, 2, 3, 4]})
new_X = patsy.build_design_matrices([X.design_info], new_data)
new_X

 Because the plus symbol (+) in the context of Patsy formulas does not mean addition,
 when you want to add columns from a dataset by name, you must wrap them in the
 special I function:

In [None]:
y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)
X

 Patsy has several other built-in transforms in the patsy.builtins module. See the
 online documentation for more.

 Categorical data has a special class of transformations, which I explain next

 Nonnumeric data can be transformed for a model design matrix in many different
 ways. A complete treatment of this topic is outside the scope of this book and would
 be studied best along with a course in statistics.

 When you use nonnumeric terms in a Patsy formula, they are converted to dummy
 variables by default. If there is an intercept, one of the levels will be left out to avoid
 collinearity:

In [None]:
data = pd.DataFrame({
    'key1': ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'b'],
    'key2': [0, 1, 0, 1, 0, 1, 0, 0],
    'v1': [1, 2, 3, 4, 5, 6, 7, 8],
    'v2': [-1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
})
y, X = patsy.dmatrices('v2 ~ key1', data)
X

 If you omit the intercept from the model, then columns for each category value will
 be included in the model design matrix:

In [None]:
y, X = patsy.dmatrices('v2 ~ key1 + 0', data)
X

 Numeric columns can be interpreted as categorical with the C function:

In [None]:
y, X = patsy.dmatrices('v2 ~ C(key2)', data)
X

When you’re using multiple categorical terms in a model, things can be more compli
cated, as you can include interaction terms of the form key1:key2, which can be
 used, for example, in analysis of variance (ANOVA) models:

In [None]:
data['key2'] = data['key2'].map({0: 'zero', 1: 'one'})
data
y, X = patsy.dmatrices('v2 ~ key1 + key2', data)
X
y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)
X

 Patsy provides for other ways to transform categorical data, including transforma
tions for terms with a particular ordering. See the online documentation for more.

# Introduction to statsmodels
 statsmodels is a Python library for fitting many kinds of statistical models, perform
ing statistical tests, and data exploration and visualization. statsmodels contains more
 “classical” frequentist statistical methods, while Bayesian methods and machine learn
ing models are found in other libraries.
 Some kinds of models found in statsmodels include:

  • Linear models, generalized linear models, and robust linear models

  •Linear mixed effects models

 •Analysis of variance (ANOVA) methods

 •Time series processes and state space models

 •Generalized method of moments

  In the next few pages, we will use a few basic tools in statsmodels and explore how
 to use the modeling interfaces with Patsy formulas and pandas DataFrame objects.

 # Estimating Linear Models
  There are several kinds of linear regression models in statsmodels, from the more
 basic (e.g., ordinary least squares) to more complex (e.g., iteratively reweighted least
 squares).

  Linear models in statsmodels have two different main interfaces: array based and
 formula based. These are accessed through these API module imports:

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

 To show how to use these, we generate a linear model from some random data. Run
 the following code in a Jupyter cell:

In [None]:
# To make the example reproducible
rng = np.random.default_rng(seed=12345)
# print(rng)
def dnorm(mean, variance, size=1):
    if isinstance(size, int):
        size = size,
    return mean + np.sqrt(variance) * rng.standard_normal(*size)

N = 100
X = np.c_[dnorm(0, 0.4, size=N),
          dnorm(0, 0.6, size=N),
          dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]

y = np.dot(X, beta) + eps
print(y)

[-0.5995 -0.5885  0.1856 -0.0075 -0.0154 -0.4841  0.0301  0.2175  0.0973
  0.2943 -0.5582  0.3928 -0.8872 -0.141  -0.2488 -0.1155  0.4903 -0.5393
  0.01   -0.1218 -0.4065 -0.263   0.2412 -0.0149 -0.8269  0.858  -0.1582
  0.3229 -0.3182 -0.2518  0.012  -0.2769  0.4892  0.0271  0.3262 -0.6701
 -0.4364  0.1988  0.2911  1.2293 -0.1345  0.1162 -0.2833  0.8264  0.6517
  0.3693  0.4606 -0.36   -0.6794 -0.3239  0.2289  0.3339 -0.0289  0.3515
  0.4105  0.0234 -0.0882 -0.4222  0.9503 -0.8432 -0.1774 -0.5828 -0.0479
  0.4998 -0.41   -0.0651 -0.1192 -0.7378  0.1129 -0.5059  0.2002  1.0372
  0.3964  0.3722  0.0822 -0.0632  0.1685 -0.3024  0.1657 -0.1187 -0.4788
  0.1031 -0.2355 -0.9313  0.3353 -0.032  -0.5318 -0.0093  0.3378 -0.3119
 -0.0479  0.3288 -0.1556  0.3523 -0.1236 -0.0679  0.8316  0.0703 -0.3865
 -0.2146]


 Here, I wrote down the “true” model with known parameters beta. In this case, dnorm
 is a helper function for generating normally distributed data with a particular mean
 and variance. So now we have:

In [None]:
X[:5]
y[:5]

array([-0.5995, -0.5885,  0.1856, -0.0075, -0.0154])

 A linear model is generally fitted with an intercept term, as we saw before with Patsy.
 The sm.add_constant function can add an intercept column to an existing matrix:

In [None]:
X_model = sm.add_constant(X)
X_model[:5]

array([[ 1.    , -0.9005, -0.1894, -1.0279],
       [ 1.    ,  0.7993, -1.546 , -0.3274],
       [ 1.    , -0.5507, -0.1203,  0.3294],
       [ 1.    , -0.1639,  0.824 ,  0.2083],
       [ 1.    , -0.0477, -0.2131, -0.0482]])

 The sm.OLS class can fit an ordinary least squares linear regression:

In [None]:
# print(X)
print(y)
model = sm.OLS(y, X)

y = ax0+bx1+cx2+d

[-0.5995 -0.5885  0.1856 -0.0075 -0.0154 -0.4841  0.0301  0.2175  0.0973
  0.2943 -0.5582  0.3928 -0.8872 -0.141  -0.2488 -0.1155  0.4903 -0.5393
  0.01   -0.1218 -0.4065 -0.263   0.2412 -0.0149 -0.8269  0.858  -0.1582
  0.3229 -0.3182 -0.2518  0.012  -0.2769  0.4892  0.0271  0.3262 -0.6701
 -0.4364  0.1988  0.2911  1.2293 -0.1345  0.1162 -0.2833  0.8264  0.6517
  0.3693  0.4606 -0.36   -0.6794 -0.3239  0.2289  0.3339 -0.0289  0.3515
  0.4105  0.0234 -0.0882 -0.4222  0.9503 -0.8432 -0.1774 -0.5828 -0.0479
  0.4998 -0.41   -0.0651 -0.1192 -0.7378  0.1129 -0.5059  0.2002  1.0372
  0.3964  0.3722  0.0822 -0.0632  0.1685 -0.3024  0.1657 -0.1187 -0.4788
  0.1031 -0.2355 -0.9313  0.3353 -0.032  -0.5318 -0.0093  0.3378 -0.3119
 -0.0479  0.3288 -0.1556  0.3523 -0.1236 -0.0679  0.8316  0.0703 -0.3865
 -0.2146]


The model’s fit method returns a regression results object containing estimated
 model parameters and other diagnostics:

In [None]:
results = model.fit()
results.params

array([0.0668, 0.268 , 0.4505])

 The summary method on results can print a model detailing diagnostic output of the
 model:

In [None]:
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.469
Model:                            OLS   Adj. R-squared (uncentered):              0.452
Method:                 Least Squares   F-statistic:                              28.51
Date:                Tue, 12 Nov 2024   Prob (F-statistic):                    2.66e-13
Time:                        01:30:00   Log-Likelihood:                         -25.611
No. Observations:                 100   AIC:                                      57.22
Df Residuals:                      97   BIC:                                      65.04
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

The parameter names here have been given the generic names x1, x2, and so on.
 Suppose instead that all of the model parameters are in a DataFrame:

In [None]:
data = pd.DataFrame(X, columns=['col0', 'col1', 'col2'])
data['y'] = y
data[:5]

Unnamed: 0,col0,col1,col2,y
0,-0.900506,-0.18943,-1.02787,-0.599527
1,0.799252,-1.545984,-0.327397,-0.588454
2,-0.550655,-0.120254,0.329359,0.185634
3,-0.163916,0.82404,0.208275,-0.007477
4,-0.047651,-0.213147,-0.048244,-0.015374


 Now we can use the statsmodels formula API and Patsy formula strings:

In [None]:
results = smf.ols('y ~ col0 + col1 + col2', data=data).fit()
results.params
results.tvalues

Unnamed: 0,0
Intercept,-0.652501
col0,1.219768
col1,6.312369
col2,6.567428


 Observe how statsmodels has returned results as Series with the DataFrame column
 names attached. We also do not need to use add_constant when using formulas and
 pandas objects.

 Given new out-of-sample data, you can compute predicted values given the estimated
 model parameters:

In [None]:
results.predict(data[:5])

Unnamed: 0,0
0,-0.592959
1,-0.53116
2,0.058636
3,0.283658
4,-0.102947


There are many additional tools for analysis, diagnostics, and visualization of linear
 model results in statsmodels that you can explore. There are also other kinds of linear
 models beyond ordinary least squares.

 # Estimating Time Series Processes
 Another class of models in statsmodels is for time series analysis. Among these
 are autoregressive processes, Kalman filtering and other state space models, and
 multivariate autoregressive models.
 Let’s simulate some time series data with an autoregressive structure and noise. Run
 the following in Jupyter:

In [None]:
init_x = 4

values = [init_x, init_x]
N = 1000

b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)
for i in range(N):
    new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
    values.append(new_x)

 This data has an AR(2) structure (two lags) with parameters 0.8 and –0.4. When you
 fit an AR model, you may not know the number of lagged terms to include, so you
 can fit the model with some larger number of lags:

In [None]:
from statsmodels.tsa.ar_model import AutoReg
MAXLAGS = 5
model = AutoReg(values, MAXLAGS)
results = model.fit()

 The estimated parameters in the results have the intercept first, and the estimates for
 the first two lags next:

In [None]:
results.params

Deeper details of these models and how to interpret their results are beyond what
 I can cover in this book, but there’s plenty more to discover in the statsmodels
 documentation.

 # Introduction to scikit-learn
  scikit-learn is one of the most widely used and trusted general-purpose Python
 machine learning toolkits. It contains a broad selection of standard supervised and
 unsupervised machine learning methods, with tools for model selection and evalua
tion, data transformation, data loading, and model persistence. These models can
 be used for classification, clustering, prediction, and other common tasks.

  pandas integration in scikit-learn has improved significantly in recent years, and by
 the time you are reading this it may have improved even more. I encourage you to
 check out the latest project documentation.
 As an example for this chapter, I use a now-classic dataset from a Kaggle competition
 about passenger survival rates on the Titanic in 1912. We load the training and test
 datasets using pandas

In [None]:
# how colab open files:
# https://saturncloud.io/blog/how-to-use-google-colab-to-work-with-local-files/
# https://stackoverflow.com/questions/48376580/how-to-read-data-in-google-colab-from-my-google-drive

from google.colab import drive
drive.mount('/content/drive', force_remount=True)
google_drive_path_header = '/content/drive/MyDrive/analytics_programming'

Mounted at /content/drive


In [None]:
train = pd.read_csv(google_drive_path_header+'/datasets/titanic/train.csv')
test = pd.read_csv(google_drive_path_header+'/datasets/titanic/test.csv')
train.head(4)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


Libraries like statsmodels and scikit-learn generally cannot be fed missing data, so we
 look at the columns to see if there are any that contain missing data:

In [None]:
train.isna().sum()
test.isna().sum()

Unnamed: 0,0
PassengerId,0
Pclass,0
Name,0
Sex,0
Age,86
SibSp,0
Parch,0
Ticket,0
Fare,1
Cabin,327


 In statistics and machine learning examples like this one, a typical task is to predict
 whether a passenger would survive based on features in the data. A model is fitted on
 a training dataset and then evaluated on an out-of-sample testing dataset.

  I would like to use Age as a predictor, but it has missing data. There are a number of
 ways to do missing data imputation, but I will do a simple one and use the median of
 the training dataset to fill the nulls in both tables:

In [None]:
impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)

 Now we need to specify our models. I add a column IsFemale as an encoded version
 of the 'Sex' column:

In [None]:
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

 Then we decide on some model variables and create NumPy arrays:

In [None]:
predictors = ['Pclass', 'IsFemale', 'Age']

X_train = train[predictors].to_numpy()
X_test = test[predictors].to_numpy()
y_train = train['Survived'].to_numpy()
X_train[:5]
y_train[:5]

array([0, 1, 1, 1, 0])

 I make no claims that this is a good model or that these features are engineered
 properly. We use the LogisticRegression model from scikit-learn and create a
 model instance:

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

 We can fit this model to the training data using the model’s fit method:

In [None]:
model.fit(X_train, y_train)
print(X_train)
print(y_train)

[[ 3.  0. 22.]
 [ 1.  1. 38.]
 [ 3.  1. 26.]
 ...
 [ 3.  1. 28.]
 [ 1.  0. 26.]
 [ 3.  0. 32.]]
[0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1
 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0
 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0
 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0
 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0
 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1
 1 1

 Now, we can form predictions for the test dataset using model.predict:

In [None]:
print(X_test)
y_predict = model.predict(X_test)
y_predict[:10]

[[ 3.   0.  34.5]
 [ 3.   1.  47. ]
 [ 2.   0.  62. ]
 ...
 [ 3.   0.  38.5]
 [ 3.   0.  28. ]
 [ 3.   0.  28. ]]


array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

 If you had the true values for the test dataset, you could compute an accuracy
 percentage or some other error metric:

 (y_true == y_predict).mean()

  In practice, there are often many additional layers of complexity in model training.
 Many models have parameters that can be tuned, and there are techniques such as
 cross-validation that can be used for parameter tuning to avoid overfitting to the
 training data. This can often yield better predictive performance or robustness on
 new data.

  Cross-validation works by splitting the training data to simulate out-of-sample pre
diction. Based on a model accuracy score like mean squared error, you can perform
 a grid search on model parameters. Some models, like logistic regression, have esti
mator classes with built-in cross-validation. For example, the LogisticRegressionCV class can be used with a parameter indicating how fine-grained of a grid search to do
 on the model regularization parameter C:

In [None]:
from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(Cs=10)
model_cv.fit(X_train, y_train)

 To do cross-validation by hand, you can use the cross_val_score helper function,
 which handles the data splitting process. For example, to cross-validate our model
 with four nonoverlapping splits of the training data, we can do:

In [None]:
from sklearn.model_selection import cross_val_score
model = LogisticRegression(C=10)
scores = cross_val_score(model, X_train, y_train, cv=10)
scores

array([0.7889, 0.7978, 0.7416, 0.8315, 0.809 , 0.7865, 0.7865, 0.764 ,
       0.8202, 0.7978])

In [None]:
pd.options.display.max_rows = PREVIOUS_MAX_ROWS