In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import logging

logging.getLogger('jieba').setLevel(logging.WARN)
logging.getLogger('fgclassifier').setLevel(logging.INFO)

In [2]:
import os

os.chdir('..')

# Model Selection

## Baseline Models

This notebook shows how to use our baseline model.
It also demonstrates how to test different feature models (i.e.,
different ways of building the features) at the same time.

We will use mostly the Google Translated English dataset for this
demonstration purpose.

In [4]:
import config
from collections import defaultdict
from sklearn.model_selection import train_test_split

from fgclassifier.utils import read_data, get_dataset

X_train, y_train = read_data(get_dataset('train'))
X_test, y_test = read_data(get_dataset('valid'))

In [5]:
# Cache feature models and trained fetures, we make this cache object
# so different steps can reuse previously transformed features
fm = defaultdict(dict)

In [14]:
# del fm['tfidf_sv']
# del fm['tfidf_sv_dense']
# del fm['lsa_200_sv']
# del fm['lsa_500_sv']

In [16]:
from fgclassifier.features import FeaturePipeline, logger

for name in ['count', 'tfidf',
             'lsa_200', 'lsa_500', 'lsa_1k',
             'count_sv', 'tfidf_sv', 'tfidf_sv_dense',
             'lsa_200_sv', 'lsa_500_sv']:
    logger.info(f'Building features for {name}...')
    model = FeaturePipeline.from_spec(name, cache=fm)
    model.fit_transform(X_train)
    model.transform(X_test)

2018-12-03 16:31:47,205 [INFO] Building features for count...
2018-12-03 16:31:47,206 [INFO]   count: fit_transform use cache.
2018-12-03 16:31:47,210 [INFO]   count: transform use cache.
2018-12-03 16:31:47,222 [INFO] Building features for tfidf...
2018-12-03 16:31:47,223 [INFO]   tfidf: fit_transform use cache.
2018-12-03 16:31:47,226 [INFO]   tfidf: transform use cache.
2018-12-03 16:31:47,230 [INFO] Building features for lsa_200...
2018-12-03 16:31:47,232 [INFO]   lsa_200: fit_transform use cache.
2018-12-03 16:31:47,235 [INFO]   lsa_200: transform use cache.
2018-12-03 16:31:47,237 [INFO] Building features for lsa_500...
2018-12-03 16:31:47,238 [INFO]   lsa_500: fit_transform use cache.
2018-12-03 16:31:47,240 [INFO]   lsa_500: transform use cache.
2018-12-03 16:31:47,241 [INFO] Building features for lsa_1k...
2018-12-03 16:31:47,243 [INFO]   lsa_1k: fit_transform use cache.
2018-12-03 16:31:47,248 [INFO]   lsa_1k: transform use cache.
2018-12-03 16:31:47,249 [INFO] Building featu

Exam the quality of the top terms:

In [22]:
from collections import Counter

print('Data Shape:', X_train.shape, X_test.shape)

for mn in ['count', 'count_sv']:
    model = fm[mn]['model'].named_steps[mn]
    x_train = fm[mn]['train']
    counts = np.sum(x_train, axis=0).flat
    counts = {k: counts[v] for k, v in model.vocabulary_.items()}
    print('\nmin_df: %.3f, max_df: %.3f, ngram_range: %s' % (
        model.min_df, model.max_df, model.ngram_range
    ))
    print('\nvocab size: %s\n' % len(model.vocabulary_))
    print('\n'.join([
        '%s \t %s' % (k, v)
        for k, v in Counter(counts).most_common()[:10]]))

Data Shape: (105000,) (15000,)

min_df: 0.005, max_df: 0.990, ngram_range: (1, 5)

vocab size: 2959

味道 	 124409
不错 	 120832
感觉 	 91453
可以 	 90883
好吃 	 83884
还是 	 83418
没有 	 75014
比较 	 73069
我们 	 67480
就是 	 66943

min_df: 0.010, max_df: 0.990, ngram_range: (1, 5)

vocab size: 1486

味道 	 124409
不错 	 120832
感觉 	 91453
可以 	 90883
好吃 	 83884
还是 	 83418
没有 	 75014
比较 	 73069
我们 	 67480
就是 	 66943


In [24]:
fm['tfidf']['model'].named_steps

{'count': FeaturePipeline(steps=count),
 'tfidf': Tfidf(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)}

## The Very Basic TF-IDF + LDA classifier

In [25]:
# Impact all feature models at once, so to avoid
# classes being reloaded and causing save_model to fail
from fgclassifier.baseline import Baseline, Dummy
from fgclassifier.classifiers import LDA
from fgclassifier.train import fm_cross_check

In [26]:
# Linear Discriminant Analysis, specify the FeaturePipeline
# as steps
model = Baseline(('LDA', LDA), fm=fm['lsa_200']['model'])

# Always pass in the original features
# the pipeline will take care of the cache
model.fit(X_train, y_train)
print(model.name)
print('Final score:', model.score(X_test, y_test))

2018-12-03 16:35:40,836 [INFO]   lsa_200: fit_transform use cache.
2018-12-03 16:38:10,370 [INFO]   lsa_200: transform use cache.


lsa_200_LDA


2018-12-03 16:38:10,648 [INFO] [Validate]: F1 Scores
2018-12-03 16:38:10,664 [INFO]   location_traffic_convenience            	0.4216
  'precision', 'predicted', average, warn_for)
2018-12-03 16:38:10,674 [INFO]   location_distance_from_business_district	0.3212
2018-12-03 16:38:10,680 [INFO]   location_easy_to_find                   	0.3737
2018-12-03 16:38:10,685 [INFO]   service_wait_time                       	0.4443
2018-12-03 16:38:10,693 [INFO]   service_waiters_attitude                	0.5768
2018-12-03 16:38:10,698 [INFO]   service_parking_convenience             	0.4081
2018-12-03 16:38:10,703 [INFO]   service_serving_speed                   	0.4231
2018-12-03 16:38:10,708 [INFO]   price_level                             	0.4622
2018-12-03 16:38:10,713 [INFO]   price_cost_effective                    	0.4542
2018-12-03 16:38:10,718 [INFO]   price_discount                          	0.5256
2018-12-03 16:38:10,724 [INFO]   environment_decoration                  	0.4345
2018-12-0

Final score: 0.43588017714066274


In [27]:
model.scores(X_test, y_test)

2018-12-03 16:38:10,907 [INFO]   lsa_200: transform use cache.
2018-12-03 16:38:11,154 [INFO] [Validate]: F1 Scores
2018-12-03 16:38:11,160 [INFO]   location_traffic_convenience            	0.4216
  'precision', 'predicted', average, warn_for)
2018-12-03 16:38:11,166 [INFO]   location_distance_from_business_district	0.3212
2018-12-03 16:38:11,171 [INFO]   location_easy_to_find                   	0.3737
2018-12-03 16:38:11,176 [INFO]   service_wait_time                       	0.4443
2018-12-03 16:38:11,183 [INFO]   service_waiters_attitude                	0.5768
2018-12-03 16:38:11,188 [INFO]   service_parking_convenience             	0.4081
2018-12-03 16:38:11,194 [INFO]   service_serving_speed                   	0.4231
2018-12-03 16:38:11,200 [INFO]   price_level                             	0.4622
2018-12-03 16:38:11,205 [INFO]   price_cost_effective                    	0.4542
2018-12-03 16:38:11,211 [INFO]   price_discount                          	0.5256
2018-12-03 16:38:11,221 [IN

[0.4216034504833237,
 0.32121982251231634,
 0.3736779422878393,
 0.44433567028770693,
 0.5767993817019281,
 0.4080903682809518,
 0.4230675551515519,
 0.4622230622791564,
 0.4542291267911739,
 0.525597439225672,
 0.43450171068241394,
 0.4133843136767832,
 0.3996748693028679,
 0.4768842811889512,
 0.36269423722351524,
 0.5286573221305858,
 0.3335855129187495,
 0.38940573775982806,
 0.4841382505503321,
 0.4838334883776081]

## Search for the Best Feature + Classifier Combination

In [28]:
# Run for all classifiers and feature builders
all_avg_scores, all_scores = defaultdict(dict), defaultdict(dict)

In [29]:
from fgclassifier import classifiers
from fgclassifier.baseline import Dummy

Dummy(classifiers.DummyStratified)

Dummy(classifier=None)

In [None]:
from IPython.display import clear_output

conf = {
    'fm_cache': fm,
    'X_train': X_train,
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test,
    'results': {
        'models': {},
        'avg': all_avg_scores,
        'all': all_scores
    }
}

# We'd only need to run the dummy models on one feature model,
# as they do not care about the features
fm_cross_check(
    ['tfidf_sv'],
    ['DummyStratified', 'DummyMostFrequent'],
    model_cls=Dummy, **conf)

# Naive Bayes models cannot handle negative values, so we pass
# in only tfidf features
fm_cross_check(
    ['tfidf', 'tfidf_sv'],
    ['MultinomialNB', 'ComplementNB'], **conf)

# All other models can run on many classifiers
results = fm_cross_check(
    ['lsa_200',
     'lsa_500',
     'lsa_1k',
     'tfidf_sv_dense',
     'lsa_200_sv',
     'lsa_500_sv',
    ],
    ['LDA', 'LinearSVC', 'Logistic', 'Ridge'], **conf)

clear_output()

2018-12-03 16:50:07,937 [INFO] 
2018-12-03 16:50:07,952 [INFO] 
2018-12-03 16:50:07,954 [INFO] Train for tfidf_sv -> DummyStratified...
2018-12-03 16:50:09,101 [INFO] [Validate]: F1 Scores
2018-12-03 16:50:09,117 [INFO]   location_traffic_convenience            	0.2524
2018-12-03 16:50:09,126 [INFO]   location_distance_from_business_district	0.2579
2018-12-03 16:50:09,145 [INFO]   location_easy_to_find                   	0.2505
2018-12-03 16:50:09,159 [INFO]   service_wait_time                       	0.2522
2018-12-03 16:50:09,169 [INFO]   service_waiters_attitude                	0.2535
2018-12-03 16:50:09,179 [INFO]   service_parking_convenience             	0.2503
2018-12-03 16:50:09,190 [INFO]   service_serving_speed                   	0.2465
2018-12-03 16:50:09,198 [INFO]   price_level                             	0.2458
2018-12-03 16:50:09,225 [INFO]   price_cost_effective                    	0.2563
2018-12-03 16:50:09,233 [INFO]   price_discount                          	0.2490
2

2018-12-03 16:50:14,664 [INFO]   environment_noise                       	0.4714
2018-12-03 16:50:14,670 [INFO]   environment_space                       	0.4599
2018-12-03 16:50:14,675 [INFO]   environment_cleaness                    	0.4898
2018-12-03 16:50:14,682 [INFO]   dish_portion                            	0.4125
2018-12-03 16:50:14,687 [INFO]   dish_taste                              	0.4958
2018-12-03 16:50:14,693 [INFO]   dish_look                               	0.3629
2018-12-03 16:50:14,699 [INFO]   dish_recommendation                     	0.3217
2018-12-03 16:50:14,705 [INFO]   others_overall_experience               	0.4576
2018-12-03 16:50:14,710 [INFO]   others_willing_to_consume_again         	0.4379
2018-12-03 16:50:14,711 [INFO] ---------------------------------------------------
2018-12-03 16:50:14,712 [INFO] 【tfidf -> ComplementNB】: 0.4223
2018-12-03 16:50:14,714 [INFO] ---------------------------------------------------
2018-12-03 16:50:14,715 [INFO] 
2018-12-03

2018-12-03 16:52:29,797 [INFO] ---------------------------------------------------
2018-12-03 16:52:29,803 [INFO] Train for lsa_200 -> LinearSVC...
2018-12-03 16:52:29,808 [INFO]   lsa_200: fit_transform use cache.
2018-12-03 16:59:03,925 [INFO]   lsa_200: transform use cache.
2018-12-03 16:59:04,081 [INFO] [Validate]: F1 Scores
  'precision', 'predicted', average, warn_for)
2018-12-03 16:59:04,096 [INFO]   location_traffic_convenience            	0.4250
2018-12-03 16:59:04,103 [INFO]   location_distance_from_business_district	0.3062
2018-12-03 16:59:04,110 [INFO]   location_easy_to_find                   	0.3590
2018-12-03 16:59:04,114 [INFO]   service_wait_time                       	0.3420
2018-12-03 16:59:04,119 [INFO]   service_waiters_attitude                	0.5445
2018-12-03 16:59:04,123 [INFO]   service_parking_convenience             	0.3114
2018-12-03 16:59:04,132 [INFO]   service_serving_speed                   	0.3820
2018-12-03 16:59:04,136 [INFO]   price_level           

2018-12-03 17:38:29,382 [INFO]   service_serving_speed                   	0.4650
2018-12-03 17:38:29,387 [INFO]   price_level                             	0.5152
2018-12-03 17:38:29,391 [INFO]   price_cost_effective                    	0.5229
2018-12-03 17:38:29,398 [INFO]   price_discount                          	0.5452
2018-12-03 17:38:29,414 [INFO]   environment_decoration                  	0.4927
2018-12-03 17:38:29,422 [INFO]   environment_noise                       	0.4709
2018-12-03 17:38:29,429 [INFO]   environment_space                       	0.4612
2018-12-03 17:38:29,438 [INFO]   environment_cleaness                    	0.5288
2018-12-03 17:38:29,444 [INFO]   dish_portion                            	0.4190
2018-12-03 17:38:29,449 [INFO]   dish_taste                              	0.5475
2018-12-03 17:38:29,454 [INFO]   dish_look                               	0.3684
2018-12-03 17:38:29,458 [INFO]   dish_recommendation                     	0.4078
2018-12-03 17:38:29,462 [INF

2018-12-03 18:09:37,915 [INFO] 【lsa_500 -> Ridge】: 0.3961
2018-12-03 18:09:37,916 [INFO] ---------------------------------------------------
2018-12-03 18:09:37,917 [INFO] 
2018-12-03 18:09:37,919 [INFO] 
2018-12-03 18:09:37,920 [INFO] Train for lsa_1k -> LDA...
2018-12-03 18:09:37,923 [INFO]   lsa_1k: fit_transform use cache.
2018-12-03 18:23:56,847 [INFO]   lsa_1k: transform use cache.
2018-12-03 18:23:57,989 [INFO] [Validate]: F1 Scores
2018-12-03 18:23:58,003 [INFO]   location_traffic_convenience            	0.4545
2018-12-03 18:23:58,009 [INFO]   location_distance_from_business_district	0.3669
2018-12-03 18:23:58,017 [INFO]   location_easy_to_find                   	0.5094
2018-12-03 18:23:58,023 [INFO]   service_wait_time                       	0.5062
2018-12-03 18:23:58,030 [INFO]   service_waiters_attitude                	0.6195
2018-12-03 18:23:58,037 [INFO]   service_parking_convenience             	0.5359
2018-12-03 18:23:58,042 [INFO]   service_serving_speed                

In [None]:
rows = {}
for fm_name in all_scores:
    for clf_name in all_scores[fm_name]:
        key = f'{fm_name}.{clf_name}'
        rows[key] = [all_avg_scores[fm_name][clf_name],
                     *all_scores[fm_name][clf_name]]
df = pd.DataFrame(rows)
df.index = ['average', *y_train.columns]
df = df.T.sort_values('average', ascending=False)
df

In [None]:
import matplotlib.pyplot as plt

df.T.drop(['average']).boxplot(
    figsize=(18, 6), rot=90)

plt.show()

Let's save the models for future use.

In [None]:
from fgclassifier.utils import save_model

def clear_cache(model):
    if hasattr(model, 'steps'):
        for (name, step) in model.steps:
            clear_cache(step)
    if hasattr(model, 'cache'):
        model.cache = None
    return model

for name, model in results['models'].items():
    clear_cache(model)
    save_model(model)

## Conclusion

- `ComplementNB` performs much better than a simple MultinomialNB, because our class labels are mostly unbalanced.
- `LatentDirichletAllocation` topics as features are not suitable for our classification problem, as features are often collinear. They often fare no better than the dummy classifier where we simply return the most frequent labels.
- LSA (Latent Semantic Analysis, Tfidf + SVD) shows a much more promising outlook, especially when combined with Linear Discriminant Analysis or SVC.
- Find the right vocabulary (min_df and ngram range) is crucial. Throw away noises early often outperforms running dimension reduction later.
- Basically SVD makes each feature (component) more indendent with each other, making LDA and SVC easier to come up with good fittings.
- Tree based models are not particularly useful. But the results may be different had we tuned the tree structure more.

## Next Steps

Required:

- Tune hyperparamters for `ComplementNB`, `TruncatedSVD`, `LinearDiscriminantAnalysis` and `SVC`/`LinearSVC`. Try different kernel functions.
- Try over-/under-sampling since most of our classes are imbalanced. [Possible solution](https://imbalanced-learn.org/)
- Test some boosting methods, especially [xgboost](https://xgboost.readthedocs.io/en/latest/).
- Test word embedding as features.

Optional:

- Possibly use different classifier for different labels.
- Test two step predictions: first run binary prediction for "mentioned" vs "not mentioned", i.e., -2 vs (-1, 0, 1), then predict (-1, 0, 1).
    - This could happen as either [ClassifierChain](https://scikit-learn.org/stable/modules/multiclass.html#classifierchain) or separate steps.

In [25]:
model = results['models']['lsa_500_en_LDA']
print(X_test[0:1].shape)
probas = model.predict_proba(X_test[0:1])
probas[0].shape

2018-12-03 11:52:24,510 [INFO]   lsa_500_en: transform use cache.


(1,)


(2000, 4)

In [20]:
model.predict(X_test[0:1])

2018-12-03 11:28:47,421 [INFO]   lsa_500_en: transform use cache.


array([[-2, -2, -2, ..., -2,  1, -2],
       [-2, -2, -2, ..., -2,  1, -2],
       [-2, -2, -2, ..., -2,  1,  1],
       ...,
       [-2, -2, -1, ...,  1, -1, -1],
       [-2, -2, -2, ..., -2,  0, -2],
       [-2, -2, -2, ..., -2,  1, -2]])