# Log Transform in Action

Let’s see how the log transform performs for supervised learning. We’ll use both of the previous datasets here. For the Yelp reviews dataset, we’ll use the number of reviews to predict the average rating of a business (see Example 2-8). For the Mashable news articles, we’ll use the number of words in an article to predict its popularity. Since the outputs are continuous numbers, we’ll use simple linear regression as the model. We use scikit-learn to perform 10-fold cross validation of linear regression on the feature with and without log transformation. The models are evaluated by the R-squared score, which measures how well a trained regression model predicts new data. Good models have high R-squared scores. A perfect model gets the maximum score of 1. The score can be negative, and a bad model can get an arbitrarily low negative score. Using cross validation, we obtain not only an estimate of the score but also a variance, which helps us gauge whether the differences between the two models are meaningful.

Using log transformed word counts in the Online News Popularity dataset to predict article popularity:

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import cross_val_score

In [12]:
df = pd.read_csv("data/OnlineNewsPopularity.csv", delimiter=", ")

# Take the log transform of the 'n_tokens_content' feature, which
# represents the number of words (tokens) in a news article.
df["log_n_tokens_content"] = np.log10(df["n_tokens_content"] + 1)

# Train two linear regression models to predict the number of shares
# of an article, one using the original feature and the other the
# log transformed version.

m_orig = linear_model.LinearRegression()
scores_orig = cross_val_score(m_orig, df[["n_tokens_content"]], df["shares"], cv=10)

m_log = linear_model.LinearRegression()
scores_log = cross_val_score(m_log, df[["log_n_tokens_content"]], df["shares"], cv=10)

print(
    "R-squared without log transform: %0.5f (+/- %0.5f)"
    % (scores_orig.mean(), scores_orig.std() * 2)
)

print(
    "R-squared with log transform: %0.5f (+/- %0.5f)"
    % (scores_log.mean(), scores_log.std() * 2)
)

  df = pd.read_csv("data/OnlineNewsPopularity.csv", delimiter=", ")


R-squared without log transform: -0.00242 (+/- 0.00509)
R-squared with log transform: -0.00114 (+/- 0.00418)
