## Training a Logistic Regression Model Using Cross-Validation

In [1]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# create headers for data
_headers = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'car']

In [3]:
# read in cars dataset
df = pd.read_csv('https://raw.githubusercontent.com/'\
                 'PacktWorkshops/The-Data-Science-Workshop/'\
                 'master/Chapter07/Dataset/car.data', names=_headers, index_col=None)
print(df.shape)
df.head()

(1728, 7)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
# encode categorical variables
_df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])
_df.head()

Unnamed: 0,car,buying_high,buying_low,buying_med,buying_vhigh,maint_high,maint_low,maint_med,maint_vhigh,doors_2,...,doors_5more,persons_2,persons_4,persons_more,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,1,0
1,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,0,1
2,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,1,0,0
3,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
4,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,0,1


In [5]:
# split the data into features and labels
features = _df.drop(['car'], axis=1).values
labels = _df[['car']].values

In [6]:
# import LogisticRegressionCV
from sklearn.linear_model import LogisticRegressionCV
model = LogisticRegressionCV(max_iter=2000, multi_class='auto', cv=5)

max_iter : You set this to 2000 so that the trainer continues training for 2000 iterations to find better weights.

multi_class: You set this to auto so that the model automatically detects that your data has more than two classes.

cv: You set this to 5, which is the number of cross-validation sets you would like to train on.

In [7]:
# fit the model
model.fit(features, labels.ravel())

LogisticRegressionCV(cv=5, max_iter=2000)

In this step, you train the model. You pass in features and labels. Because labels is a 2D array, you make use of ravel() to convert it into a 1D array or vector.

In [14]:
# evaluate the training R2
print(model.score(features, labels.ravel()))

0.9456018518518519


In this step, we make use of the training dataset to compute the R2 score. While we didn't set aside a specific validation dataset, it is important to note that the model only saw 80% of our training data, so it still has new data to work with for this evaluation.