# 0. Importing Basic Libraries, Definitions & Constants

In [1]:
import numpy as np
import pandas as pd

In [32]:
import csv
import re
import jieba.analyse
import random
from bs4 import BeautifulSoup
from collections import defaultdict

In [33]:
random_seed = 64 #Can be anyvalue - but it is important to set one to keep training/holdout set constant
relevance_cutoff = 1.8 # required TFIDF value to be included in the relevant token set
occurrent_cutoff = 50 #required document occurrence to be included in the relevant token set
test_share = 0.2 #share of the holdout sample vs the training sample


# 1. Preprocessing
## 1.1 Loading Training Data (For Input Vectors)

In [3]:
training_df = pd.read_csv('offsite-test-material/offsite-tagging-training-set.csv', encoding='utf8')
training_df.index = training_df['id']
training_df.head()

Unnamed: 0_level_0,id,tags,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3443,3443,足球,利物浦重賽擊敗乙組仔　英足盃過關 英格蘭足總盃第三圈今晨重賽，貴為英超勁旅的利物浦上場被乙組...
76056,76056,足球,【中超】恒大「暴力戰」絕殺國安　楊智反重力插水惹爭議（有片） 中超首輪賽事重頭戲，廣州恒大主...
93405,93405,足球,【歐霸決賽】阿積士控球率起腳佔優　隊長卡拉臣輸波不服氣 阿積士以歐洲主要決賽最年輕、平均22...
26767,26767,足球,【歐國盃】韋莫斯澄清更衣室未內訌　盼以團結力量挫愛爾蘭 今晚3場直播\r\r\nE組｜比利時...
20843,20843,梁振英,王維基參選　點解？ 王維基在宣布有意出選的記者會上，打出ABC，Anyone But CY的...


In [4]:
def remove_html(text):
    soup = BeautifulSoup(text, 'html5lib') #remove HTML tokens
    text_only = soup.get_text() 
    text_normal_newline = re.sub("\n\n+", "\n", text_only)
    text_normal_space = re.sub("\s\s+", " ", text_normal_newline)
    return text_normal_space

In [5]:
training_df['text_clean'] = training_df.apply(lambda _: remove_html(_['text']), axis=1)
training_df['clean_length'] = training_df.apply(lambda _: len(_['text_clean']), axis=1)
training_df.head()

Unnamed: 0_level_0,id,tags,text,text_clean,clean_length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3443,3443,足球,利物浦重賽擊敗乙組仔　英足盃過關 英格蘭足總盃第三圈今晨重賽，貴為英超勁旅的利物浦上場被乙組...,利物浦重賽擊敗乙組仔　英足盃過關 英格蘭足總盃第三圈今晨重賽，貴為英超勁旅的利物浦上場被乙組...,369
76056,76056,足球,【中超】恒大「暴力戰」絕殺國安　楊智反重力插水惹爭議（有片） 中超首輪賽事重頭戲，廣州恒大主...,【中超】恒大「暴力戰」絕殺國安　楊智反重力插水惹爭議（有片） 中超首輪賽事重頭戲，廣州恒大主...,631
93405,93405,足球,【歐霸決賽】阿積士控球率起腳佔優　隊長卡拉臣輸波不服氣 阿積士以歐洲主要決賽最年輕、平均22...,【歐霸決賽】阿積士控球率起腳佔優　隊長卡拉臣輸波不服氣 阿積士以歐洲主要決賽最年輕、平均22...,948
26767,26767,足球,【歐國盃】韋莫斯澄清更衣室未內訌　盼以團結力量挫愛爾蘭 今晚3場直播\r\r\nE組｜比利時...,【歐國盃】韋莫斯澄清更衣室未內訌　盼以團結力量挫愛爾蘭 今晚3場直播\nE組｜比利時Vs愛爾...,770
20843,20843,梁振英,王維基參選　點解？ 王維基在宣布有意出選的記者會上，打出ABC，Anyone But CY的...,王維基參選　點解？ 王維基在宣布有意出選的記者會上，打出ABC，Anyone But CY的...,1239


In [6]:
def process_group(row):
    return pd.Series(dict(char_cnt=row['clean_length'].sum(), record_cnt=row.clean_length.count()))
labels_df = pd.DataFrame(training_df.groupby(['tags']).apply(process_group))
labels_df['label_id'] = pd.Categorical(labels_df.index).codes
label_dict = {a: b.label_id for a, b in labels_df.iterrows()}
label_id_dict = {b.label_id: a for a, b in labels_df.iterrows()}
labels = list(label_dict.keys())
labels_df.head()


Unnamed: 0_level_0,char_cnt,record_cnt,label_id
tags,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
梁振英,868598,929,0
美國大選,972470,842,1
足球,1672172,2123,2


## 1.2 Loading Data (For Frequency Analysis)

In [7]:
text_dict = defaultdict(list)
with open('offsite-test-material/offsite-tagging-training-set.csv', 'r', encoding='utf8') as f:
    file_reader = csv.reader(f, delimiter=',', quotechar='"')
    next(file_reader)
    for row in file_reader:
        text_dict[row[1]].append(remove_html(row[2]))

fulltext_dict = {k: '\n'.join([_ for _ in v]) for k, v in text_dict.items()}

In [8]:
print('Found the following categories:\n{}'.format('\n'.join(['{}: {} fragments with {} characters'
                                                               .format(k, len(text_dict[k]), len(fulltext_dict[k]))
                                                               for k in text_dict.keys() ])))

Found the following categories:
足球: 2123 fragments with 1674294 characters
梁振英: 929 fragments with 869526 characters
美國大選: 842 fragments with 973311 characters


This means that we have twice the frequency of articles related to soccer than to either the outgoing CE or US elections. This is a bit tricky in terms of maximum TF-IDF

## 1.2 Selecting most relevant tokens
I am building a TD-IDF-esque model, for which I will select the most 'relevant' tokens as features. Relevance here is defined as the highest ratio of frequency in the relevant 'term' over the frequency in the overall 'document'. A 'term' here is the union of all segments that belong to a single categroy. The document is the union of all segments

In [9]:
#dictionary for occurrence of short tokens in each classified doc
labeldicts_short = {_: (defaultdict(float), 0) for _ in labels} 
#dictionary for occurrence of long tokens in each classified doc
labeldicts_long = {_: (defaultdict(float), 0) for _ in labels} 
# dictionary for occurrence of short tokens in the whole document
docdict_short = defaultdict(float), 0 
# dictionary for occurrence of long tokens in the whole document
docdict_long = defaultdict(float), 0


In [10]:
# Counting token frequency
training_clean = list()
for label, combined_text in fulltext_dict.items():
    short_tokens = jieba.cut(combined_text, cut_all=True)
    for token in short_tokens:
        labeldicts_short[label][0][token] += 1
        docdict_short[0][token] += 1
        
    long_tokens = jieba.cut(combined_text, cut_all=False)
    for token in long_tokens:
        labeldicts_long[label][0][token] += 1
        docdict_long[0][token] += 1
        
        


Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/9d/qql6_x6575d88_7f44mgptw40000gp/T/jieba.cache
Loading model cost 0.820 seconds.
Prefix dict has been built succesfully.


In [11]:
# Calculating term/document length
for label in labels:
    labeldicts_long[label] = labeldicts_long[label][0], sum(labeldicts_long[label][0].values())
    labeldicts_short[label] = labeldicts_short[label][0], sum(labeldicts_short[label][0].values())
    
docdict_long = docdict_long[0], sum(docdict_long[0].values())
docdict_short = docdict_short[0], sum(docdict_short[0].values())


In [12]:
# helper function that returns the highest TFIDF of a token. 
# highly relevant tokens will have maximum TDIDFs of 2-3,  
# they exclusively occur in fragments of one class, but the size of the term length differs
# irrelevant tokens will have uniform TFIDFs of 1 (they occur everywhere with the same frequencyy)
def relative_frequency(token, classdicts, docdict, docdict_total=None):
    occurrences = [(classdict[0][token], classdict[1]) for classdict in classdicts if token in classdict[0]]
    if occurrences:
        max_occurence, term_length = max(occurrences, key=lambda _: _[0]/_[1])
        total_occurrence, doc_length  = docdict[0][token], docdict[1]
        tf = (max_occurence/term_length)
        df = (total_occurrence/doc_length)
        return (tf/df, max_occurence, total_occurrence)
    else:
        print(token)
        return 0, 0, docdict[0].get(token, 0)

relative_frequency('重賽', labeldicts_long.values(), docdict_long)

(2.0907319330176026, 29.0, 29.0)

In [13]:
short_classdicts = labeldicts_short.values()
long_classdicts = labeldicts_long.values()
maxfreq_short = {key: relative_frequency(key, short_classdicts, docdict_short) for key in docdict_short[0].keys()}
maxfreq_long = {key: relative_frequency(key, long_classdicts, docdict_long) for key in docdict_long[0].keys()}

In [14]:
relevant_tokens_short_list = sorted([key for key, value in maxfreq_short.items() 
                                     if (value[0] > relevance_cutoff and
                                         value[2] > occurrent_cutoff and
                                         key.isalpha())])

relevant_tokens_long_list = sorted([key for key, value in maxfreq_long.items() 
                                    if (value[0] > relevance_cutoff and
                                        value[2] > occurrent_cutoff and
                                        key.isalpha())])

## 1.3 Creating Training Data

In [15]:
def sentence_to_vector(sentence, tokenlist, cut_all=False):
    a = defaultdict(int)
    tokens = jieba.cut(sentence, cut_all=cut_all)
    for token in tokens:
        a[token] += 1
    out_dict = {_: a.get(_, 0) for _ in tokenlist}
    return pd.Series(out_dict)

occ_input_long = pd.DataFrame(training_df.text_clean.apply(
    lambda _: sentence_to_vector(_, relevant_tokens_long_list)))

occ_input_short = pd.DataFrame(training_df.text_clean.apply(
    lambda _: sentence_to_vector(_, relevant_tokens_short_list, cut_all=True)))

data_target = pd.DataFrame(training_df.merge(labels_df, how='inner', left_on='tags', right_index=True)['label_id'])    

In [16]:
indices = list(occ_input_long.index)
random.seed(random_seed)
test_indices = random.sample(indices, int(len(indices)*test_share))
train_indices = [_ for _ in indices if _ not in test_indices]
training_data_long = occ_input_long.loc[train_indices]
training_data_short = occ_input_short.loc[train_indices]
holdout_data_long = occ_input_long.loc[test_indices]
holdout_data_short = occ_input_short.loc[test_indices]
training_target = data_target.loc[train_indices]
holdout_target = data_target.loc[test_indices]


# 2. Training Models
## 2.0 Imports & Definitions

In [26]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.svm import SVC


In [18]:
def explain_misclassification(id_, prediction, holdout):
    relevant_text = training_df.loc[id_]['text_clean']
    provided_label = training_df.loc[id_]['tags']
    predicted_label = label_id_dict[prediction[holdout.index.get_loc(id_)]]
    print('The following text was classified as {0}, but labelled as {1}:\n{2}'
          .format(predicted_label, provided_label, relevant_text))

# 2.1 Standard RandomForest Classifier

In [19]:
RFC_short = RandomForestClassifier()
RFC_short.fit(np.asarray(training_data_short), np.asarray(training_target).ravel())
rfc_prediction_short = RFC_short.predict(np.asarray(holdout_data_short))
misclassified_ids_short = list(sorted(holdout_target[rfc_prediction_short!=holdout_target['label_id']].index))
print('A RandomForest Classifier reached an accuracy score of {0:.4f} for short tokens.\nThis means that a total of {1}'
      ' fragments (out of {2} fragments in the holdout sample) was misclassified.\nThe misclassified ids are:\n{3}'
      .format(accuracy_score(rfc_prediction_short, holdout_target),
              len(misclassified_ids_short),
              len(holdout_target),
             ', '.join(str(_) for _ in misclassified_ids_short)))

A RandomForest Classifier reached an accuracy score of 0.9923 for short tokens.
This means that a total of 6 fragments (out of 778 fragments in the holdout sample) was misclassified.
The misclassified ids are:
14227, 14792, 47772, 51805, 58992, 80645


In [20]:
RFC_long = RandomForestClassifier()
RFC_long.fit(np.asarray(training_data_long), np.asarray(training_target).ravel())
rfc_prediction_long = RFC_long.predict(np.asarray(holdout_data_long))
misclassified_ids_long = list(holdout_target[rfc_prediction_long!=holdout_target['label_id']].index)
print('A RandomForest Classifier reached an accuracy score of {0:.4f} for short tokens.\nThis means that a total of {1}'
      ' fragments (out of {2} fragments in the holdout sample) was misclassified.\nThe misclassified ids are:\n{3}'
      .format(accuracy_score(rfc_prediction_long, holdout_target),
              len(misclassified_ids_long),
              len(holdout_target),
             ', '.join(str(_) for _ in misclassified_ids_long)))

A RandomForest Classifier reached an accuracy score of 0.9910 for short tokens.
This means that a total of 7 fragments (out of 778 fragments in the holdout sample) was misclassified.
The misclassified ids are:
54209, 14792, 47772, 47893, 51805, 1160, 61971


In [22]:
misclassified_id = misclassified_ids_short[0]
explain_misclassification(misclassified_id, rfc_prediction_short, holdout_target)

The following text was classified as 梁振英, but labelled as 足球:
傑志中心剔出改劃　區議員指原區重置無可能　仍有機會回收場地 沙田區議會發展及房委會將於本周四（11月3日）討論安睦街資助房屋發展計劃。根據規劃署最新文件，傑志中心地皮已被剔出改劃建議，至於中心現時空置的北面用地，政府則建議由「休憩用地」改劃為「住宅」（約0.43公頃），最大住用總樓面面積不超過約2. 沙田區議會發展及房委會將於本周四（11月3日）討論安睦街資助房屋發展計劃。根據規劃署最新文件，傑志中心地皮已被剔出改劃建議，至於中心現時空置的北面用地，政府則建議由「休憩用地」改劃為「住宅」（約0.43公頃），最大住用總樓面面積不超過約2.6萬平方米，建築物高度限為32層內，預料可提供約560個單位。
不過「魔鬼細節」卻在註腳上，文件細字上寫明現時傑志中心只作短期租用，政府另覓土地重置訓練中心及在落實遷置後，便展開改劃工作，再次呼應特首梁振英早前指傑志中心重置後始收地的言論。
延伸閱讀：政府擬收回石門傑志中心建資助出售房屋 明年約滿可合法收地
在facebook披露政府摸底工作的沙田區議員容溟舟接受《香港01》訪問時表示，傑志要在沙田區重置「近乎無可能」，因附近平地都已正在興建住宅，今次政府剔出改劃範圍只屬短期措施，長遠亦有機會等待該地於2017年9月租約期滿後再回收亦合法。


## 2.2 Gradient Boosted Classifier (Standard SkLearn)

In [34]:
GBC_short = GradientBoostingClassifier()
GBC_short.fit(np.asarray(training_data_short), np.asarray(training_target).ravel())
gbc_prediction_short = GBC_short.predict(np.asarray(holdout_data_short))
misclassified_ids_short = list(sorted(holdout_target[gbc_prediction_short!=holdout_target['label_id']].index))
print('A GradientBoosted tree ensemble Classifier reached an accuracy score of {0:.4f} for short tokens.'
      '\nThis means that a total of {1}'
      ' fragments (out of {2} fragments in the holdout sample) was misclassified.\nThe misclassified ids are:\n{3}'
      .format(accuracy_score(gbc_prediction_short, holdout_target),
              len(gbc_prediction_short),
              len(holdout_target),
             ', '.join(str(_) for _ in misclassified_ids_short)))

A GradientBoosted tree ensemble Classifier reached an accuracy score of 0.9884 for short tokens.
This means that a total of 778 fragments (out of 778 fragments in the holdout sample) was misclassified.
The misclassified ids are:
5375, 14792, 23475, 47772, 51805, 53293, 58992, 80645, 88049


In [None]:
misclassified_id = misclassified_ids_short[0]
#explain_misclassification(misclassified_id, gbc_prediction_short, holdout_target)

In [36]:
GBC_long = GradientBoostingClassifier()
GBC_long.fit(np.asarray(training_data_long), np.asarray(training_target).ravel())
gbc_prediction_long = GBC_long.predict(np.asarray(holdout_data_long))
misclassified_ids_long = list(sorted(holdout_target[gbc_prediction_long!=holdout_target['label_id']].index))
print('A GradientBoosted tree ensemble Classifier reached an accuracy score of {0:.4f} for short tokens.'
      '\nThis means that a total of {1}'
      ' fragments (out of {2} fragments in the holdout sample) was misclassified.\nThe misclassified ids are:\n{3}'
      .format(accuracy_score(gbc_prediction_long, holdout_target),
              len(misclassified_ids_long),
              len(holdout_target),
             ', '.join(str(_) for _ in misclassified_ids_long)))

A GradientBoosted tree ensemble Classifier reached an accuracy score of 0.9897 for short tokens.
This means that a total of 8 fragments (out of 778 fragments in the holdout sample) was misclassified.
The misclassified ids are:
14792, 45215, 47772, 47893, 51805, 51851, 54209, 80645


In [37]:
misclassified_id = misclassified_ids_long[0]
#explain_misclassification(misclassified_id, gbc_prediction_long, holdout_target)

The following text was classified as 足球, but labelled as 梁振英:
【向奧巴馬學習】要人支持你　首先你要真心支持本地體育 要得到人真心支持，不是單單只看 8 分鐘的比賽便當自己支持本地體育……不知道新上任的體育專員楊德強先生，會否願意抽時間和市民一齊排隊等買飛，了解香港籃壇生態，做一個不離地的高官？邱益忠 拜讀馬嶽教授一篇「當抽足球水抽火水 拜讀馬嶽教授一篇「當抽足球水抽火水」後，筆者第一個想起的人是美國總統奧巴馬。成也危機公關，奧巴馬極擅長在莊嚴與幽默之間取得平衡，其化危為機的能力甚高，如果他退任後決定開班傳授「化解關公災難」，相信不少人會爭相報讀。
忠實球迷身分　營造親民形象
真．熱愛體育的奧巴馬，對足球、棒球、高爾夫、網球、乒乓球、保齡球等運動相當熟悉，他更是芝加哥公牛的瘋狂球迷，每年被問到哪支球隊是總冠軍熱門，總是堅定不移地回答：「公牛」。
自「籃球之神」Michael Jordan 退休後，公牛隊一直陷入漫長的重建期，若非近幾年 Derrick Rose 無法躲過輪迴的傷病，公牛隊早就重回顛峰，想到這點，公牛球迷總掩不住失望。
「原來總統也會有評估錯誤的時候？」
「原來總統和我一樣，不離不棄支持家鄉球隊？」
鐵血球迷的本色，成功把奧巴馬和普通市民的距離拉近不少。 抽水功力深厚　搞氣氛能手
每年贏得美國四大聯賽的冠軍球隊（NBA 籃球、NFL 美式足球、NHL 冰上曲棍球、MLB 棒球），都會獲得美國總統在白宮接見，這絕對是運動員的最高榮譽。連在 NBA 要風得風、一向自信爆棚、每天在床上「被帥醒」的大帝 Lebron James，在白宮發言時也不禁「口窒窒」，甚至露出童真一面，高呼「Mama，I did it!」。
每次在白宮接見 NBA 冠軍球隊，奧巴馬不會阿諛奉承，反而不斷「搵位入」，大讚自己最愛的公牛隊。如數年前湖人到訪白宮時，奧巴馬非常「識做」，先祝賀當時教練 Phil Jackson 贏得第 10 次總冠軍，成就史上第一的戰績；但立刻鬼馬地補充一句：「不過其中 6 個冠軍是在公牛拿下的，記得嗎，Magic Johnson？」讓 Magic Johnson 哭笑不得。
2014 年兩連冠的熱火創下史上第 2 多的 27 連勝，奧巴馬不忘抽水：「27 連勝的紀錄非常了不起，幾乎可和公牛的 72 

## 2.3 Support Vector Machine with Cosine Similarity Kernel

In [38]:
svc_short = SVC(kernel=cosine_similarity)
svc_short.fit(np.asarray(training_data_short), np.asarray(training_target).ravel())
svc_prediction_short = svc_short.predict(np.asarray(holdout_data_short))
misclassified_ids_short = list(sorted(holdout_target[svc_prediction_short!=holdout_target['label_id']].index))
print('A SupportVectorMachine with CosineSimilarity kernel reached an accuracy score of {0:.4f} for short tokens.'
      '\nThis means that a total of {1}'
      ' fragments (out of {2} fragments in the holdout sample) was misclassified.\nThe misclassified ids are:\n{3}'
      .format(accuracy_score(svc_prediction_short, holdout_target),
              len(svc_prediction_short),
              len(holdout_target),
             ', '.join(str(_) for _ in misclassified_ids_short)))

A SupportVectorMachine with CosineSimilarity kernel reached an accuracy score of 0.9961 for short tokens.
This means that a total of 778 fragments (out of 778 fragments in the holdout sample) was misclassified.
The misclassified ids are:
14792, 47772, 51805


In [39]:
misclassified_id = misclassified_ids_short[0]
explain_misclassification(misclassified_id, svc_prediction_short, holdout_target)

The following text was classified as 足球, but labelled as 梁振英:
【向奧巴馬學習】要人支持你　首先你要真心支持本地體育 要得到人真心支持，不是單單只看 8 分鐘的比賽便當自己支持本地體育……不知道新上任的體育專員楊德強先生，會否願意抽時間和市民一齊排隊等買飛，了解香港籃壇生態，做一個不離地的高官？邱益忠 拜讀馬嶽教授一篇「當抽足球水抽火水 拜讀馬嶽教授一篇「當抽足球水抽火水」後，筆者第一個想起的人是美國總統奧巴馬。成也危機公關，奧巴馬極擅長在莊嚴與幽默之間取得平衡，其化危為機的能力甚高，如果他退任後決定開班傳授「化解關公災難」，相信不少人會爭相報讀。
忠實球迷身分　營造親民形象
真．熱愛體育的奧巴馬，對足球、棒球、高爾夫、網球、乒乓球、保齡球等運動相當熟悉，他更是芝加哥公牛的瘋狂球迷，每年被問到哪支球隊是總冠軍熱門，總是堅定不移地回答：「公牛」。
自「籃球之神」Michael Jordan 退休後，公牛隊一直陷入漫長的重建期，若非近幾年 Derrick Rose 無法躲過輪迴的傷病，公牛隊早就重回顛峰，想到這點，公牛球迷總掩不住失望。
「原來總統也會有評估錯誤的時候？」
「原來總統和我一樣，不離不棄支持家鄉球隊？」
鐵血球迷的本色，成功把奧巴馬和普通市民的距離拉近不少。 抽水功力深厚　搞氣氛能手
每年贏得美國四大聯賽的冠軍球隊（NBA 籃球、NFL 美式足球、NHL 冰上曲棍球、MLB 棒球），都會獲得美國總統在白宮接見，這絕對是運動員的最高榮譽。連在 NBA 要風得風、一向自信爆棚、每天在床上「被帥醒」的大帝 Lebron James，在白宮發言時也不禁「口窒窒」，甚至露出童真一面，高呼「Mama，I did it!」。
每次在白宮接見 NBA 冠軍球隊，奧巴馬不會阿諛奉承，反而不斷「搵位入」，大讚自己最愛的公牛隊。如數年前湖人到訪白宮時，奧巴馬非常「識做」，先祝賀當時教練 Phil Jackson 贏得第 10 次總冠軍，成就史上第一的戰績；但立刻鬼馬地補充一句：「不過其中 6 個冠軍是在公牛拿下的，記得嗎，Magic Johnson？」讓 Magic Johnson 哭笑不得。
2014 年兩連冠的熱火創下史上第 2 多的 27 連勝，奧巴馬不忘抽水：「27 連勝的紀錄非常了不起，幾乎可和公牛的 72 