In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import sys
import os
import datetime
import random
import math
import matplotlib.pyplot as plt


* Tf-idf is a simple twist on the bag-of-words approach. It stands for term frequency– inverse document frequency. Instead of looking at the raw counts of each word in each document in a dataset, tf-idf looks at a normalized count where each word count is divided by the number of documents this word appears in

* Tf-idf transforms word count features through multiplication with a constant. Hence, it is an example of feature scaling, a concept introduced in Chapter 2

In [4]:
import json
import pandas as pd

# Load Yelp business data
biz_f = open('yelp_academic_dataset_business.json')
biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
biz_f.close()

review_file = open('yelp_academic_dataset_review.json')
review_df = pd.DataFrame([json.loads(x) for x in review_file.readlines()])
review_file.close()

# Pull out only Nightlife and Restaurants businesses
two_biz = biz_df[biz_df.apply(lambda x: 'Nightlife' in x['categories'] or
                                        'Restaurants' in x['categories'],
                                         axis=1)]
# Join with the reviews to get all reviews on the two types of business
twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner')
# Trim away the features we won't use
twobiz_reviews = twobiz_reviews[['business_id',
    'name',
    'stars_y',
    'text',
    'categories']]
# Create the target column--True for Nightlife businesses, and False otherwise >>> two_biz_reviews['target'] = \
twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'],
    axis=1)



0         False
1         False
2         False
3         False
4         False
          ...  
166033    False
166034    False
166035    False
166036    False
166037    False
Length: 166038, dtype: bool

In this chapter, we used tf-idf as an entry point into a detailed analysis of how feature transformations can affect the model (or not). Tf-idf is an example of feature scaling, so we contrasted its performance with that of another feature scaling method—l2 normalization.

The results were not as one might have expected. Tf-idf and l2 normalization do not improve the final classifier’s accuracy above plain bag-of-words. After acquiring some statistical modeling and linear algebra chops, we realize why: neither of them changes the column space of the data matrix.

One small difference between the two is that tf-idf can “stretch” the word count as well as “compress” it. In other words, it makes some counts bigger, and others close to zero. Therefore, tf-idf could altogether eliminate uninformative words.

Along the way, we also discovered another effect of feature scaling: it improves the condition number of the data matrix, making linear models much faster to train. Both l2 normalization and tf-idf have this effect.

To summarize, the lesson is: the right feature scaling can be helpful for classification. The right scaling accentuates the informative words and downweights the common words. It can also improve the condition number of the data matrix. The right scaling is not necessarily uniform column scaling.

This story is a wonderful illustration of the difficulty of analyzing the effects of fea‐ ture engineering in the general case. Changing the features affects the training process and the models that ensue. Linear models are the simplest models to under‐ stand, yet it still takes very careful experimentation methodology and a lot of deep mathematical knowledge to tease apart the theoretical and practical impacts. This would be mostly impossible with more complicated models or feature transforma‐ tion