# Assignment 10

### 1.复习上课内容

### 2. 回答一下理论题目

#### 1. What is independent assumption in Naive bayes ?

Ans:
Naive Bayes is so called because the independence assumptions we have just made are indeed very naive for a model of natural language. The conditional independence assumption states that features are independent of each other given the class.   
Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z   
$$\forall(i,j,k) P(X=x_{i}|Y=y_{j},Z=z_{k}) = P(X=x_{i}|Z=z_{k}) $$
Which we often write 
$$P(X|Y,Z) = P(X|Z)$$
实际应用中，独立假设是说假设数据之间是独立的，而实际上可能是联合的，但那样对数据集等的要求比较高，计算也相对复杂，所以将其作独立假设来简单化。

#### 2. What is MAP(maximum a posterior) and ML(maximum likelihood) ?

Ans:  
假设D是观测数据，H是假设空间，那么有
    $$h_{MAP} = arg \max_{h \epsilon H} P(h|D)=arg \max_{h \epsilon H} \frac{P(D|h)P(h)}{P(D)}=arg \max_{h \epsilon H}P(D|h)P(h)$$
$h_{MAP}$即为最大后验概率，表征假设空间内最可能的h的概率。  
令上式中的P(h)为一个常数，即是说假设空间内的所有h的比重是一样的，那么则可将P(h)从式中去掉，该式则可简化为:$$h_{ML} = arg \max_{h \epsilon H} P(D|h)$$  

称$h_{ML}$为最大似然(ML)。

#### 3. What is support vector in SVM?

Ans:  
在支持向量机中，距离超平面最近的且满足一定条件的几个训练样本点被称为支持向量.  
假设超平面(w,b)能将训练样本正确分类,即对于$(x_{i},y_{i})\epsilon D$, 若$y_{i}=+1$,则有$w^{T}x_{i}+b>0$;若$y_{i}=-1$,则有$w^{T}x_{i}+b<0$.令$$\left\{
\begin{aligned}
w^{T}x_{i}+b \geq +1, y_{i}=+1; \\
w^{T}x_{i}+b \leq -1, y_{i}=-1. 
\end{aligned}
\right.$$
距离超平面最近的几个使上式成立的训练样本点就成为"支持向量".

#### 4. What is the intuition behind SVM ?

Ans:  
给定有一批训练集,需要将不同类别的样本分开,而能将训练样本分开的超平面可能有很多,需要选出最好的一个,直观上看应该是找位于正中间的划分超平面,如图中红色的平面.
SVM要做的剧场通过已有的数据找到这个超平面将数据划分为正样本和负样本,当有新的数据输入时,可也测判断出输入属于正样本还是负样本. 
![avatar](./svm-demo.jpg)  

#### 5. Shortly describ what 'random' means in random forest ?

Ans:  
随机森林指的是利用多棵树对样本进行训练并预测的一种分类器, 是Bagging的一个扩展变体。其随机的含义主要体现在两个方面：    
1) 样本的随机: 训练时的样本是从初始样本随机采样得到的。  
2) 属性的随机：训练时随机选取样本的若干个特性进行训练。  
因此，随机森林中基学习器的多样性不仅来自样本的扰动，还来自属性的扰动，最终使得集成的泛化性能可通过个体学习器之间的差异度增加来提升。

#### 6. What cariterion does XGBoost use to find the best split point in a tree ?

Ans:  
寻找最佳分割点的大致步骤如下:  
1) 遍历每个结点的每个特征；  
2) 对每个特征，按特征值大小将特征值排序；  
3) 线性扫描，找出每个特征的最佳分裂特征值；  
4) 在所有特征中找出最好的分裂点(分裂后增益最大的特征及特征值);   
增益的定义如下：  
$$𝐺𝑎𝑖𝑛= \frac {1}{2} [\frac {𝐺_𝐿^2}{𝐻_𝐿+𝜆}+ \frac{𝐺_𝑅^2}{𝐻_𝑅+𝜆}−\frac {(𝐺_𝐿+𝐺_𝑅 )^2}{(𝐻_𝐿+𝐻_𝑅+𝜆)}]−𝜆 $$


### 3. Practial part

##### Problem description: In this part you are going to build a classifier to detect if a piece of news is published by the Xinhua news agency (新华社）.

#### Hints:

###### 1. Firstly, you have to come up with a way to represent the news. (Vectorize the sentence, you can find different ways to do so online)  

In [1]:
import os
import pickle
from functools import reduce

import gensim
import jieba
import numpy as  np
import pandas as pd
from gensim.models import KeyedVectors
from gensim.models.word2vec import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


def handle_news(stop_words_list: list):
    essays_path = './news_data.csv'
    contents = pd.read_csv(essays_path, encoding='gb18030', usecols=["source", "content"])
    news = []
    labels = []
    count = 0
    for each in contents.iterrows():
        content = str(each[1]['content']).strip()
        source = str(each[1]['source']).strip()
        if content == 'nan':
            continue
        if content is None or not isinstance(content, str):
            continue
        content = handle_doc(content, stop_words_list)
        news.append(content)
        if '新华社' in source:
            labels.append('1')
        else:
            labels.append('0')
        count += 1
        if count % 2000 == 0:
            print('handle docs: ' + str(count))

    with open("./news.txt", 'w', encoding='utf-8') as f:
        f.writelines(news)
        f.flush()
        f.close()
    with open("./labels.txt", 'w', encoding='utf-8') as f:
        f.writelines(labels)
        f.flush()
        f.close()

    # print("获取到的文章数:" + str(len(essays)))
    # print("新华社的文章数:" + str(count))
    return


def split_content(content: str, stop_words: list):
    simpled = ''
    s = content.replace("新华社", "")
    s = content.replace("\n", "")
    if s == "":
        return simpled
    segs = jieba.cut(s)
    for seg in segs:
        if seg in stop_words:
            continue
        simpled += seg + " "
    return simpled


def handle_doc(doc: str, stop_words_list: list):
    doc = doc.replace("\n", "。").strip()
    doc = doc.replace(r"\n", "。").strip()
    doc = doc.replace("\r", "。").strip()
    doc = doc.replace("\t", "。").strip()
    doc = doc.replace("新华社", "").strip()
    content = split_content(doc, stop_words_list) + "\n"
    return content


def get_word_vector(word: str, word_vector_model: Word2Vec):
    try:
        word_vector = word_vector_model[word]
    except KeyError:
        word_vector = np.zeros(word_vector_model.vector_size)
    return word_vector


def load_word_vector_model(path: str, self_trained: bool):
    print("加载的词向量的路径: " + path)
    # 加载glove转换的模型: 保存的为文本形式
    if self_trained:
        word_embedding = gensim.models.Word2Vec.load(path)
    else:
        word_embedding = KeyedVectors.load_word2vec_format(path)
    print('load finished.')
    return word_embedding


def generate_doc_vector(doc: str, word_vec_model: Word2Vec):
    words = doc.split(" ")
    word_vec = np.zeros(word_vec_model.vector_size)
    for word in words:
        word_vec += get_word_vector(word, word_vec_model)
    word_vec = word_vec / len(words)
    return word_vec


def compute_docs_vec(docs: list, model):
    return np.row_stack([generate_doc_vector(doc, model) for doc in docs])


def load_docs_labels(model):
    with open('./news.txt', 'r', encoding='utf-8') as f:
        lines = f.readlines()
        docs = [str(line) for line in lines]
    with open('./labels.txt', 'r', encoding='utf-8') as f:
        lines = f.readlines()
        labels_str = str(lines[0]).strip()
        labels = [int(label) for label in labels_str]

    docs_vec = compute_docs_vec(docs, model)
    labels = np.asarray(labels)
    return docs_vec, labels


###### 2. Secondly,  pick a machine learning algorithm that you think is suitable for this task

In [5]:

def train_model(model=None, name=None):
    x, y = load_docs_labels(word_vec_model)
    train_idx, test_idx = train_test_split(range(len(y)), test_size=0.2, stratify=y)
    train_x = x[train_idx, :]
    train_y = y[train_idx]
    test_x = x[test_idx, :]
    test_y = y[test_idx]
    if model is None:
        model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
        name = 'LogisticRegression'
    model.fit(train_x, train_y)
    print("model: " + name)
    print("Training set score: {:.3f}".format(model.score(train_x, train_y)))
    print("Test set score: {:.3f}".format(model.score(test_x, test_y)))
    y_pred = model.predict(test_x)
    eval = eval_model(test_y, y_pred, np.asarray([0, 1]))
    print(eval)
    return model


def predict(doc, model):
    doc_vec = generate_doc_vector(doc, word_vec_model)
    doc_vec = np.asarray(doc_vec).reshape(1, -1)
    y = model['lr'].predict(doc_vec)
    if y[0] == 0:
        return '该新闻非新华社发布'
    else:
        return '该新闻由新华社发布'


# 计算各项评价指标
def eval_model(y_true, y_pred, labels):
    # 计算每个分类的Precision, Recall, f1, support
    p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
    # 计算总体的平均Precision, Recall, f1, support
    tot_p = np.average(p, weights=s)
    tot_r = np.average(r, weights=s)
    tot_f1 = np.average(f1, weights=s)
    tot_s = np.sum(s)
    res1 = pd.DataFrame({
        u'Label': labels,
        u'Precision': p,
        u'Recall': r,
        u'F1': f1,
        u'Support': s
    })
    res2 = pd.DataFrame({
        u'Label': [u'总体'],
        u'Precision': [tot_p],
        u'Recall': [tot_r],
        u'F1': [tot_f1],
        u'Support': [tot_s]
    })
    res2.index = [999]
    res = pd.concat([res1, res2])
    return res[[u'Label', u'Precision', u'Recall', u'F1', u'Support']]


def save_model(model, output_dir):
    model_file = os.path.join(output_dir, u'model.pkl')
    with open(model_file, 'wb') as outfile:
        pickle.dump({
            'y_encoder': np.asarray([0, 1]),
            'lr': model
        }, outfile)
    return


def load_model(path):
    with open(path + 'model.pkl', 'rb') as infile:
        lr_model = pickle.load(infile)
    return lr_model


stop_words = open(u'stopwords.txt', "r", encoding="utf-8").readlines()
stop_words_list = [line.strip() for line in stop_words]
# # handle_news(stop_words_list)
word_vec_model = load_word_vector_model(path='./word_embedding_model_100', self_trained=True)
model = train_model()
save_model(model, './')
with open('./news_demo.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    doc = reduce(lambda x, y: x + y, lines)
    print("预测的文章:" + doc)
    doc = handle_doc(doc, stop_words_list)
    f.close()
model = load_model('./')
result = predict(doc, model)
print('预测结果:\n'+result)


加载的词向量的路径: ./word_embedding_model_100
load finished.
预测的文章:网易体育4月16日报道：

4月14日，于汉超在广州涂改车牌，该事件被网友录制后上传网络，随即引发轩然大波。

当晚，恒大发公告宣布开除于汉超，之后德国转会市场网也将于汉超变为自由身。

对于恒大的做法，郝海东再次发话：“郝海东在这里跟许家印说一声，尊重一下劳动法，这个行为够不够解除合同的程度，别把足球运动员都当成工具，给自己留点后路。”
预测结果:
该新闻非新华社发布




### Congratulations! You have completed all assignments in this week. The question below is optional. If you still have time, why don't try it out.

## Option:

#### Try differnt machine learning algorithms with different combinations of parameters in the practical part, and compare their performances (Better use some visualization techiniques).

In [6]:
def compare_model():
    model_1 = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
    train_model(model_1, name='LogisticRegression')
    print('--------------------------------------')
    model_2 = GaussianNB()
    train_model(model_2, name='GaussianNB')
    return
compare_model()




model: LogisticRegression
Training set score: 0.928
Test set score: 0.928
    Label  Precision    Recall        F1  Support
0       0   0.712690  0.418854  0.527621     1676
1       1   0.940704  0.982015  0.960915    15735
999    总体   0.918755  0.927804  0.919206    17411
--------------------------------------




model: GaussianNB
Training set score: 0.728
Test set score: 0.725
    Label  Precision    Recall        F1  Support
0       0   0.239592  0.855012  0.374298     1676
1       1   0.978740  0.710963  0.823633    15735
999    总体   0.907589  0.724829  0.780380    17411
