# Logistic Regression with Pipelines


In this notebook we are going to use the sms messages data we used for Naive Bayes, and discuss pipelines a little more. We are also going to use a new metric, log_loss, which is a metric that penalizes wrong classifications. In other words, while we want to see a high accuracy, we want to see a low log_loss. In order to compute log_loss, the predict_proba() function is called to extract probabilities rather than predictions.



In [3]:
# all the imports for the next code block

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, log_loss


First we do logistic regression without pipelines.

In [4]:
df = pd.read_csv('../data/sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')

# set up X and y
X = df.text
y = df.spam

# divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

# vectorizer
vectorizer = TfidfVectorizer(binary=True)
X_train = vectorizer.fit_transform(X_train)  # fit and transform the train data
X_test = vectorizer.transform(X_test)        # transform only the test data

#train
classifier = LogisticRegression(solver='lbfgs', class_weight='balanced')
classifier.fit(X_train, y_train)

# evaluate
pred = classifier.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))
probs = classifier.predict_proba(X_test)
print('log loss: ', log_loss(y_test, probs))

accuracy score:  0.984504132231405
precision score:  0.926829268292683
recall score:  0.95
f1 score:  0.9382716049382716
log loss:  0.15218610084693873


### Pipelines

Next we run the algorithm again but use pipelines.

So what methods can you put in a pipeline? You can use sklearn class functions or write your own. The intermediate steps in the pipeline must be *transformers*, that is, they need .fit() and .transform() methods, or a fit_transform() method, and the last one in the pipeline  usually is an *estimator*, that is, it implements a .fit() method. 

In the code block below we created a pipeine named **pipe1** using the Pipeline() method. The arguments to the Pipeline() method are a list of tuples, where the first item in each tuple is an identifier you choose for each stage, and the second item in the tuple is the method for that stage in the pipeline.

In [7]:
from sklearn.pipeline import Pipeline

# read in data, split raw data into train and test, then use pipeline to transform
df = pd.read_csv('../data/sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['spam'], test_size=0.2, train_size=0.8, random_state=1234)

pipe1 = Pipeline([
        ('tfidf', TfidfVectorizer(binary=True)),
        ('logreg', LogisticRegression(solver='lbfgs', class_weight='balanced')),
])

pipe1.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=True,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('logreg',
                 LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                                    fit_intercept=True, intercept

In [8]:
pred = pipe1.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))
probs = pipe1.predict_proba(X_test)
print('log loss: ', log_loss(y_test, probs))

accuracy score:  0.984504132231405
precision score:  0.926829268292683
recall score:  0.95
f1 score:  0.9382716049382716
log loss:  0.15218610084693873


### Observations

The first thing to notice is that there were fewer code steps involved in using the pipeline. The second observation is that the results are the same.

Before moving on the cross-validation, let's look at some information we can get from our pipeline. This will help us identify parameters we may want to tune during cross validation.

Notice above that sklearn printed out information about the pipeline steps, by name. We can extract that information as well as shown below.

In [5]:
# inspect the pipeline steps
pipe1.steps

[('tfidf',
  TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.float64'>, encoding='utf-8',
                  input='content', lowercase=True, max_df=1.0, max_features=None,
                  min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                  smooth_idf=True, stop_words=None, strip_accents=None,
                  sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, use_idf=True, vocabulary=None)),
 ('logreg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                     fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                     max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                     warm_start=False))]

In [6]:
# inspect coeficients of the model
pipe1.named_steps['logreg'].coef_

array([[ 0.68642287,  0.84627621, -0.02188987, ..., -0.01523863,
         0.18401337, -0.99653254]])

In [7]:
# inspect parameters of the model
pipe1.named_steps['logreg'].get_params()


{'C': 1.0,
 'class_weight': 'balanced',
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'warn',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [8]:
# to check all the parameters of the pipeline, do this:
pipe1.get_params()

{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.float64'>, encoding='utf-8',
                   input='content', lowercase=True, max_df=1.0, max_features=None,
                   min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                   smooth_idf=True, stop_words=None, strip_accents=None,
                   sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, use_idf=True, vocabulary=None)),
  ('logreg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                      fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                      max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                      random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                      warm_start=False))],
 'verbose': False,
 'tfidf': TfidfVectorizer(analyzer='word', binary=False, decode_e

We've seen that we can get parameters out of the pipeline, but we can also set them. Below we set a min_df a little higher which will ignore very infrequent terms, and we set the C parameter in logistic regression a little higher. The higher the C term, the more it tamps down the regularization. 

We see slightly better results by tweaking a couple of parameters. The main point here, though, was to show how the pipeline made it easy to experiment with parameters.

In [9]:
# set a couple of parameters
pipe1.set_params(tfidf__min_df=3, logreg__C=2.0).fit(X_train, y_train)
pred = pipe1.predict(X_test)
print("accuracy: ", accuracy_score(y_test, pred))
probs = pipe1.predict_proba(X_test)
print("log loss: ", log_loss(y_test, probs))

accuracy:  0.9865702479338843
log loss:  0.10179801938057494


The syntax of sklearn pipelines takes a little getting used to, but does tend to reduce the number of lines of code you have to read through and as we have seen, makes experimenting a little easier.