## Part 1: Feature Engineering

This notebook showcases different techniques for feature engineering and how they perform with the baseline model.

In [6]:
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import logging


logging.getLogger('fgclassifier').setLevel(logging.INFO)

## 1. Data Format and Word Segmentation

First, check how word segmentation works with the library we chose:

In [7]:
import jieba
from config import validate_data_path, train_data_path
from fgclassifier import read_csv

df = read_csv(validate_data_path, seg_words=False)

print(df['content'][0])
segs = jieba.lcut(df['content'][0])
print(" ".join(segs))

2018-11-11 14:56:23,403 [INFO] Reading data/validate/sentiment_analysis_validationset.csv..


"哎，想当年来佘山的时候，啥都没有，三品香算镇上最大看起来最像样的饭店了。菜品多，有点太多，感觉啥都有，杂都不足以形容。随便点些，居然口味什么的都好还可以，价钱自然是便宜当震惊。元宝虾和椒盐九肚鱼都不错吃。不过近来几次么，味道明显没以前好了。冷餐里面一个凉拌海带丝还可以，酸酸甜甜的。镇上也有了些别的大点的饭店，所以不是每次必来了。对了，这家的生意一如既往的超级好，不定位基本吃不到。不过佘山这边的人吃晚饭很早的，所以稍微晚点去就很空了。"
" 哎 ， 想当年 来 佘山 的 时候 ， 啥 都 没有 ， 三品 香算 镇上 最大 看起来 最 像样 的 饭店 了 。 菜品 多 ， 有点 太 多 ， 感觉 啥 都 有 ， 杂都 不足以 形容 。 随便 点些 ， 居然 口味 什么 的 都 好 还 可以 ， 价钱 自然 是 便宜 当 震惊 。 元宝 虾 和 椒盐 九肚鱼 都 不错 吃 。 不过 近来 几次 么 ， 味道 明显 没 以前 好 了 。 冷餐 里面 一个 凉拌 海带丝 还 可以 ， 酸酸甜甜 的 。 镇上 也 有 了 些 别的 大点 的 饭店 ， 所以 不是 每次 必来 了 。 对 了 ， 这家 的 生意 一如既往 的 超级 好 ， 不 定位 基本 吃 不到 。 不过 佘山 这边 的 人 吃晚饭 很早 的 ， 所以 稍微 晚点 去 就 很 空 了 。 "


In [8]:
# Replace blank space with some materialized words
jieba.add_word('BBLANKK')
jieba.lcut("我，来到北京  清华大学".replace(' ', 'BBLANKK'))

['我', '，', '来到', '北京', 'BBLANKK', 'BBLANKK', '清华大学']

In [9]:
df.iloc[0,:]

id                                                                                          0
content                                     "哎，想当年来佘山的时候，啥都没有，三品香算镇上最大看起来最像样的饭店了。菜品多，有点太多，...
location_traffic_convenience                                                               -2
location_distance_from_business_district                                                   -2
location_easy_to_find                                                                      -2
service_wait_time                                                                           0
service_waiters_attitude                                                                   -2
service_parking_convenience                                                                -2
service_serving_speed                                                                      -2
price_level                                                                                 1
price_cost_effective                                        

## 2. Basic Statistics

First, check how many records we have. As word segmentation takes a while, we read the raw data first.

In [10]:
from config import validate_data_path, train_data_path, testa_data_path

# Without segmentation, this is faster
df_train = read_csv(train_data_path, seg_words=False, sample_n=None)
df_validate = read_csv(validate_data_path, seg_words=False, sample_n=None)
print("Training data:", df_train.shape)
print("Validation data:", df_validate.shape)

2018-11-11 14:56:23,766 [INFO] Reading data/train/sentiment_analysis_trainingset.csv..
2018-11-11 14:56:25,451 [INFO] Reading data/validate/sentiment_analysis_validationset.csv..


Training data: (105000, 22)
Validation data: (15000, 22)


In [11]:
df_testa = read_csv(testa_data_path, seg_words=False, sample_n=None)
print("Test-A data:", df_testa.shape)

2018-11-11 14:56:25,754 [INFO] Reading data/test-a/sentiment_analysis_testa.csv..


Test-A data: (15000, 22)


Then let's check after segmentation, what does the data look like.

In [None]:
df_train_full = read_csv(train_data_path, seg_words=True, sample_n=None)
df_validate_full = read_csv(validate_data_path, seg_words=True, sample_n=None)

In [None]:
from collections import Counter

def count_words(content, counter=None):
    counts = counter or Counter()
    sentences = []
    for s in content:
        ss = s.split(' ')
        sentences.append(ss)
        counts.update(ss)
    return counts, sentences

count_train, sentences = count_words(df_train_full['content'])
print(count_train.most_common()[:10])

In [None]:
from wordcloud import WordCloud

font = './misc/SourceHanSansHWSC/SourceHanSansHWSC-Regular.otf'
wordcloud = WordCloud(
    font_path=font, width=1200, height=800,
    background_color='rgb(55, 71, 79)',
).generate_from_frequencies(dict(count_train.most_common()[60:5000]))

plt.figure(figsize=(12, 8))

# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
print(np.mean([len(s) for s in sentences]))

In [None]:
print('Vocabulary size: {}'.format(len(count_train)))

In [None]:
import pandas as pd

df = pd.DataFrame({ 'values': list(count_train.values()) })
df.describe()

In [None]:
df_train.drop(['id', 'content'], axis=1).hist(figsize=(14, 14))
plt.show()

## Overall Distribution

In [None]:
x = df_train.drop(['id', 'content'], axis=1).values.ravel()
plt.hist(x)
plt.xticks([-2, -1, 0, 1])
plt.show()

In [None]:
import config
from fgclassifier.baseline import Indie

model = Indie()
X_train, Y_train = model.load(config.train_data_path, sample_n=1000)

## 3. Make English Translation

To help non-English speakers understand the content, we make a subset of the training data with English translations.

If you want to run the following code yourself, follow the instructions [here](https://cloud.google.com/translate/docs/quickstart-client-libraries#client-libraries-install-python).

In [1]:
from config import validate_data_path, train_data_path, testa_data_path
from fgclassifier import read_csv
from sklearn.externals import joblib

df_train = read_csv(train_data_path, seg_words=False, sample_n=None)

Building prefix dict from the default dictionary ...
2018-11-11 20:04:35,454 [DEBUG] Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/6r/772b4sy16rg94jhq1fskv9fc0000gn/T/jieba.cache
2018-11-11 20:04:35,457 [DEBUG] Loading model from cache /var/folders/6r/772b4sy16rg94jhq1fskv9fc0000gn/T/jieba.cache
Loading model cost 0.828 seconds.
2018-11-11 20:04:36,283 [DEBUG] Loading model cost 0.828 seconds.
Prefix dict has been built succesfully.
2018-11-11 20:04:36,285 [DEBUG] Prefix dict has been built succesfully.
2018-11-11 20:04:36,287 [INFO] Reading data/train/sentiment_analysis_trainingset.csv..


In [2]:
try:
    translations = joblib.load('data/train/en.pkl')
except:
    translations = {}

In [3]:
import glob
import time
from IPython.display import clear_output

from tqdm import tqdm
from google.cloud import translate
from sklearn.externals.joblib import Parallel, delayed

# All available credentials
credentials = glob.glob('./misc/google-cloud/*.json')

# Use multiple credentials to bypass rate limit
clients = []
for credential in credentials:
    print(credential)
    clients.append(translate.Client.from_service_account_json(credential))

df = df_train.copy().iloc[0:10000,:]
contents = [x.strip('"') for x in df['content']]
n_client = len(clients)
n_records = df.shape[0]

client_ok = [True for _ in clients]


def get_client(i):
    c = 0
    while not client_ok[i % n_client] and c < n_client:
        c += 1
        i += 1
    i = i % n_client
    client = clients[i] if c < n_client else None
    return i, client

failed = []

clear_output()
pbar = tqdm(total=n_records)
queue = list(range(n_records))
n_failed = 0

while len(queue) and n_failed < n_client:
    i = queue.pop(0)
    if i not in translations:
        start_time = time.time()
        client_idx, client = get_client(i)
        if not client:
            raise RuntimeError('No Available Client.')
        try:
            translation = client.translate(contents[i],
                target_language='en', source_language='zh')
            translations[i] = translation['translatedText']
        except Exception as e:
            # print(client_idx + 1, e)
            client_ok[client_idx] = False
            queue.append(i)
            n_failed += 1
            continue
        end_time = time.time()
        # If finished within 1 second, wait...
        if end_time < start_time + 0.5:
            time.sleep(start_time + 0.5 - end_time)
    pbar.update(1)

100%|██████████| 10000/10000 [05:05<00:00,  1.24s/it] 

In [4]:
joblib.dump(translations, 'data/train/en.pkl')

['en_train.pkl']

In [20]:
# Replace content with translation, and replace apostrophe 
df['content'] = [x.replace('&#39;', "'") for x in pd.Series(translations).sort_index()]

In [21]:
# Sanity Check
print(df_train['content'][9999])
print(df['content'][9999])

"对于很多新店，确实要多个心眼。因为确实存在部分商家会制造虚假点评诱导消费者（譬如前不久的鱼当家），而且出于各种考虑，大众官方对于店家刷点评的事件经常是睁一只眼闭一只眼。所以我经常给小伙伴说，如果一家店（尤其是新店），有三条千万注意：1、仅有的几条到十几条评价都是全五星。2、这些点评在几天内密集出现。3、点评号的级别大都低于三颗星，且有大量系统自动分配账号，无VIP账号点评。那么这家店很可能是在刷点评。而这样的店家通常都有一个共同的特点，就是比较容易着急（用词很委婉了哦：）。须知，重视顾客点评本身是好事，但是饮食业要想做好还是那三条：口味好、服务好、性价比高。这三条做好了，那么酒香不怕巷子深，生意自然会好，体验较好的会员大都会本着良心点评。而如果脱离这个，只想靠刷点评，就好比雇人发传单，知道的人固然多了，但是最重要的三条跟不上，那么只能是一锤子买卖——可能有人会说，南京这么大，我一锤子买卖也能赚好多了呢——须知，一个一锤子买卖积累起来，负面评价会很快冲掉刷的评价，最终的结果不言而喻。——当然，按照店家的逻辑，我在体验前似乎也不应该写点评。不过请放心，大叔已经购买了团购券，会在一周内探店的，到时候自然会根据体验情况如实地更改评价。"
For many new stores, it really takes a lot of attention. Because there are some merchants who will create false reviews to induce consumers (such as the fisherman's head of the past), and for various reasons, the public official's comments on the store's comments are often one eye closed. Therefore, I often tell my friends that if a store (especially a new store) has three items to pay attention to: 1. Only a few to a dozen ratings are all five stars. 2. These reviews app

In [23]:
df.head(1).T

Unnamed: 0,0
id,0
content,"Hey, the lollipop of the dead man, the overlor..."
location_traffic_convenience,-2
location_distance_from_business_district,-2
location_easy_to_find,-2
service_wait_time,-2
service_waiters_attitude,1
service_parking_convenience,-2
service_serving_speed,-2
price_level,-2


In [26]:
# Save
import csv

# Sample data obtained by Google Translating to English
df.to_csv('data/english.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)