## Assignment-07 First Step of using machine learning and models.

### 任务描述

报社等相关的机构，往往会遇到一个问题，就是别人家的机构使用自己的文章但是并没有标明来源。 在本次任务中，我们将解决新华社的文章被抄袭引用的问题。

给定的数据集合中，存在一些新闻预料，该预料是来自新华社，但是其来源并不是新华社，请设计机器学习模型解决该问题。

### Step1: 数据分析

请在课程的GitHub中下载数据集，然后使用pandas进行读取。

In [1]:
new_file = '/Users/qiujiafa/NLP_lessons_online/data/news_chinese_dumpload.csv'

In [2]:
import pandas as pd
import numpy as np

In [3]:
news_df = pd.read_csv(new_file)

In [4]:
len(news_df)
# 原始 89611， 去重空值后 87054 行

89611

In [5]:
# 有 2557 行 新闻内容为空， 需去掉空值
len(news_df[news_df['content'].isnull()])
news_df.dropna(subset=['content'], inplace=True)

In [6]:
news_df.head()

Unnamed: 0.1,Unnamed: 0,id,author,source,content,feature,title,url
0,0,1,王子江 张寒,新华社,新华社德国杜塞尔多夫６月６日电题：乒乓女球迷　\n 新华社记者王子江、张寒\n 熊老...,"{""type"":""体育"",""site"":""新华社"",""url"":""http://home.x...",（体育）题：乒乓女球迷,http://home.xinhua-news.com/gdsd
1,1,2,夏文辉,新华社,\n\n2017年5月25日，在美国马萨诸塞州剑桥市，哈佛大学毕业生在毕业典礼上欢呼。（新华...,"{""type"":""其它"",""site"":""新华社"",""url"":""http://home.x...",哈佛大学为何取消这些新生入选资格？,http://home.xinhua-news.com/gdsd
2,2,3,张旌,新华社,\n\n2017年5月29日，在法国巴黎郊外的凡尔赛宫，法国总统马克龙出席新闻发布会。（新华...,"{""type"":""其它"",""site"":""新华社"",""url"":""http://home.x...",法国议会选举　马克龙有望获“压倒性胜利”,http://home.xinhua-news.com/gdsd
3,3,4,王衡,新华社,新华社兰州6月3日电（王衡、徐丹）记者从甘肃省交通运输厅获悉，甘肃近日集中开建高速公路、普通...,"{""type"":""宏观经济"",""site"":""新华社"",""url"":""http://home...",（经济）甘肃集中开工35个重点交通建设项目,http://home.xinhua-news.com/gdsd
4,4,5,邹峥,新华社,新华社照片，多伦多，2017年6月7日\n（体育）（2）冰球——国家女子冰球队海外选秀在多伦...,"{""type"":""冰球"",""site"":""新华社"",""url"":""http://home.x...",（体育）（2）冰球——国家女子冰球队海外选秀在多伦多展开,http://home.xinhua-news.com/gdsd


In [6]:
news_df.drop(columns=['Unnamed: 0'], inplace=True)

In [7]:
news_df.columns

Index(['id', 'author', 'source', 'content', 'feature', 'title', 'url'], dtype='object')

In [8]:
xinhua_news = news_df[news_df['source'] == '新华社']

In [9]:
# 新华社文章占比
len(xinhua_news) / len(news_df)

0.9035885772049532

### Step2: 数据预处理

将pandas中的数据，依据是否是新华社的文章，改变成新的数据dataframe: <content, y>, 其中，content是文章内容，y是0或者1. 你可能要使用到pandas的dataframe操作。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

In [10]:
import jieba
import re

In [11]:
def cut_content(content):
    """ 对文章内容进行中文切词处理 """
    content = ''.join(re.findall(r'\w+', content))
    return ' '.join(jieba.cut(content))
    
  

In [12]:
all_news = news_df[['source', 'content']]

In [13]:
all_news['mark'] = all_news.apply(lambda x: 1 if x['source'] == '新华社' else 0, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [14]:
all_news['content'] = all_news['content'].map(lambda x: cut_content(x))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/sb/6yx77d415rgf3770g8nhntfc0000gn/T/jieba.cache
Loading model cost 0.648 seconds.
Prefix dict has been built succesfully.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
all_news.head()

Unnamed: 0,source,content,mark
0,新华社,新华社 德国 杜塞尔多夫 ６ 月 ６ 日电 题 乒乓 女球迷 n 新华社 记者 王子江 张寒...,1
1,新华社,nn2017 年 5 月 25 日 在 美国 马萨诸塞州 剑桥市 哈佛大学 毕业生 在 毕业...,1
2,新华社,nn2017 年 5 月 29 日 在 法国巴黎 郊外 的 凡尔赛宫 法国 总统 马克 龙 ...,1
3,新华社,新华社 兰州 6 月 3 日电 王衡 徐丹 记者 从 甘肃省 交通运输 厅 获悉 甘肃 近日...,1
4,新华社,新华社 照片 多伦多 2017 年 6 月 7 日 n 体育 2 冰球 国家 女子 冰球队 ...,1


In [17]:
all_news.drop(columns='source', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


### Step3: 使用tfidf进行文本向量化

参考 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html 对

对文本进行向量化

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
from time import time

In [22]:
TfidfVectorizer?

In [20]:
vectorizer = TfidfVectorizer(max_features=200)

In [21]:
contents = all_news['content']

In [22]:
marks  = all_news['mark']

In [23]:
len(contents)

87054

In [24]:
t0 = time()
Contents_X = vectorizer.fit_transform(contents)
duration = time() - t0
print(f"Done convert news contents into tfidf matrix in {duration}s")

Done convert news contents into tfidf matrix in 9.022587776184082s


In [25]:
Contents_X.shape

(87054, 200)

### Step4: 划分训练集，测试集

In [26]:
from sklearn.model_selection import train_test_split

In [27]:
train_contents, test_contents, train_mark, test_mark = train_test_split(Contents_X, marks, test_size=0.2)

In [24]:
train_test_split?

In [28]:
train_contents.shape

(69643, 200)

In [29]:
test_contents.shape

(17411, 200)

In [30]:
print(len(train_mark))
print(len(test_mark))

69643
17411


## Step 5: 利用第8课讲述的新模型，进行操作，感受其中不同的参数、模型对性能的影响。

### 1. logistic regression 逻辑回归

In [32]:
from sklearn.linear_model import LogisticRegression

In [42]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

In [34]:
LogisticRegression?

In [39]:
t0 = time()
lr_module = LogisticRegression(solver='saga').fit(train_contents, train_mark)
duration = time() - t0
print(f"Done in {duration}s")

Done in 0.467195987701416s


In [40]:
lr_module.score(test_contents, test_mark)

0.9811038998334386

In [41]:
test_mark_predict = lr_module.predict(test_contents)

In [43]:
def get_module_score(result, predict_result):
    print(f"precision score: {precision_score(result, predict_result)}")
    print(f"recall score: {recall_score(result, predict_result)}")
    print(f"f1_score: {f1_score(result, predict_result)}")
    print(f"roc_auc_score: {roc_auc_score(result, predict_result)}")

In [44]:
get_module_score(test_mark, test_mark_predict)

precision score: 0.9865137383413158
recall score: 0.9927072103494198
f1_score: 0.9896007838922781
roc_auc_score: 0.9311891715571703


### 2. Naive Bayes 朴素贝叶斯

In [45]:
from sklearn.naive_bayes import GaussianNB

In [52]:
GaussianNB?

In [50]:
nb = GaussianNB()

In [59]:
train_contents.shape


(69643, 200)

In [60]:
nb.fit(train_contents.toarray(), train_mark)

GaussianNB(priors=None, var_smoothing=1e-09)

In [62]:
test_mark_predict1 = nb.predict(test_contents.toarray())

In [63]:
get_module_score(test_mark, test_mark_predict1)

precision score: 0.9944404157602127
recall score: 0.7826748684127085
f1_score: 0.8759403832505324
roc_auc_score: 0.8703264719651849


### 3. SVM 支持向量机

In [64]:
from sklearn.svm import SVC, LinearSVC

In [65]:
SVC?

In [70]:
svc = SVC(probability=True, gamma='scale')

In [71]:
svc.fit(train_contents, train_mark)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [72]:
test_mark_predict2 = svc.predict(test_contents)

In [73]:
get_module_score(test_mark, test_mark_predict2)

precision score: 0.981431528762311
recall score: 0.9921364702898091
f1_score: 0.9867549668874172
roc_auc_score: 0.9059342521972796


### 4. decision tree

In [74]:
from sklearn.tree import DecisionTreeClassifier

In [75]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(train_contents, train_mark)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [76]:
test_mark_predict3 = decision_tree.predict(test_contents)

In [77]:
get_module_score(test_mark, test_mark_predict3)

precision score: 0.9932711229607059
recall score: 0.9922633014141671
f1_score: 0.9927669564113952
roc_auc_score: 0.9638539405974611


### 5. random forest

In [78]:
from sklearn.ensemble import RandomForestClassifier

In [79]:
RandomForestClassifier?

In [82]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(train_contents, train_mark)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [83]:
test_mark_predict4 = random_forest.predict(test_contents)

In [84]:
get_module_score(test_mark, test_mark_predict4)

precision score: 0.9949968334388853
recall score: 0.9963218973936204
f1_score: 0.99565892455401
roc_auc_score: 0.9741049194641671
