# 第八章 应用机器学习于情感分析

自然语言处理（NLP）的一个分支——情感分析

本章主要涵盖下述几个方面
    1. 清洗和准备文本数据
    2. 根据文本数据建立特征向量
    3. 训练机器学习模型来区分正面或负面评论
    4. 用基于外存的学习方法来处理大型文本数据集
    5. 根据文档推断主题进行分类

## 8.1 为文本处理预备好IMDb电影评论数据

情感分析有时也被称为意见挖掘，是NLP广泛领域中的一个分支，着重于分析文档的倾向。

情感分析的一个热门任务是根据作者对特定主题所表达的观点或情感为文档分类。

电影评论数据是由50000个带有倾向性的电影评论组成，每个评论都被标记为正面或负面：
    正面：评论超过6颗星
    负面：评论低于5颗星

### 8.1.1 获取电影评论数据集 

In [1]:
# 可以从python中届亚Gzip压缩的tarball

### 8.1.2 把评论数据预处理成更方便格式的数据 

In [2]:
# 将电影评论读入pandas DataFrame 对象

In [3]:
import pandas as pd
import pyprind
import os

In [4]:
basepath = 'aclImdb_v1/aclImdb'

In [5]:
labels = {'pos': 1, 'neg': 0}

In [6]:
pbar = pyprind.ProgBar(50000) #初始化进度条，定义迭代次数为50000

In [7]:
df = pd.DataFrame()

In [8]:
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]],
                            ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [############################  ] 100% | ETA: 00:00:10

In [9]:
df[:5]

Unnamed: 0,review,sentiment
0,"Stephen Hawking has one of the greatest minds,...",1
1,This video is a fantastic testament and insigh...,1
2,This movie tells the story of nine ambitious t...,1
3,"This series, while idealized and fictionalized...",1
4,Recovery is an incredibly moving piece of work...,1


In [10]:
# 调用np.random 子模块的permutation函数对DataFrame洗牌 打乱顺序。并以csv格式保存

In [11]:
import numpy as np

In [12]:
np.random.seed(0)

In [13]:
df = df.reindex(np.random.permutation(df.index))

In [14]:
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [15]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')

In [16]:
df.head(3)

Unnamed: 0,review,sentiment
0,Just reading why this show got canceled makes ...,1
1,i love this movie. it focuses on both issues: ...,1
2,Although at one point I thought this was going...,1


In [17]:
df[:3]

Unnamed: 0,review,sentiment
0,Just reading why this show got canceled makes ...,1
1,i love this movie. it focuses on both issues: ...,1
2,Although at one point I thought this was going...,1


## 8.2 词袋模型介绍

## 8.2.1 把词转换为特征向量

In [18]:
# scikit-learn 实现的CountVectorizer

In [19]:
import numpy as np

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
count = CountVectorizer()

In [22]:
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'
])

In [23]:
bag = count.fit_transform(docs)

In [24]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [25]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


## 8.2.2 通过词频逆反文档频率评估单词相关性 

In [26]:
# scikit-learn 实现TfidTransformer类 以原始词频为输入，转换为tf-idfs

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer

In [28]:
tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
# l2归一化 返回长度为1的向量

In [29]:
np.set_printoptions(precision=2)

In [30]:
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


## 8.2.3 清洗文本数据
           清除不需要的字符

In [31]:
# 使用re正则表达式来清除HTML标记和标点符号等

In [32]:
import re

In [33]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
               ' '.join(emoticons).replace('-', ''))
    return text

In [34]:
# 用preprocessor 函数处理DataFrame上所有的评论
df['review'] = df['review'].apply(preprocessor)

In [35]:
preprocessor('<h1>adjiaw<h1>')

'adjiaw'

## 8.2.4 把文档处理为令牌
        如何将文本语料库拆分成独立的元素

In [36]:
# 以空白字符拆分为单词
def tokenizer(text):
    return text.split()

In [37]:
# 词干技术--波特分词算法
# 自然语言处理工具集NLTK实现了波特分词算法
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [38]:
tokenizer_porter('runners like running and thus they run') # 获得词干

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [39]:
# 其他还有
    #雪球词干算法
    #开斯特词干分析器
# 词干法会创造不存在的单词
# 词元法是一种旨在获得每个单词规范格式的技术 与词干法相差性能不大

In [40]:
# 停用词删除 如is and has like 等常用词
# 调用nltk.download函数完成
import nltk
# 下载停用词集
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ishikawa407/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [41]:
# 下载完停用词集后
from nltk.corpus import stopwords

In [42]:
stop = stopwords.words('english') # 加载英文停用词

In [43]:
stop[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

## 8.3 训练文档分类的逻辑回归模型 

In [44]:
# 将清理过的DataFrame分成25000个训练文档和25000个测试文档

In [167]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
x_test = df.loc[25000:, 'review'].values
x_test = df.loc[25000:, 'sentiment'].values

In [168]:
# 调用GridSearchCV对象， 采用5倍分层交叉验证法(5层)，寻找最佳参数集

In [169]:
from sklearn.model_selection import GridSearchCV # 分层 寻找最佳参数集
from sklearn.pipeline import Pipeline # make_pipeline 为其简化版
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer # TfidfTransformer不带transform类 pipeline 必须

In [170]:
tfdif = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

In [171]:
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer,
                                  tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer,
                                  tokenizer_porter],
               'vect__use_idf': [False],
               'vect__norm': [None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]
              }]

In [172]:
# Pipeline （‘name’, 'transformer'）
lr_tfdif = Pipeline([('vect', tfdif),
                     ('clf', LogisticRegression(random_state=0))])

In [173]:
gs_lr_tfdif = GridSearchCV(lr_tfdif, param_grid,
                           scoring='accuracy',
                           cv=5, verbose=1,
                           n_jobs=1)

In [174]:
gs_lr_tfdif.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  'stop_words.' % sorted(inconsistent))


KeyboardInterrupt: 

## 8.4 处理更大的数据集——在线算法和核心学习
        核外学习 通过对数据集的小批增量来模拟分类器完成大型数据集的处理工作
        
        本节将用scikit—learn的SGDClassfier的partial_fit函数从本地驱动器直接获取
        流式文件，并用文件的小批次文档训练逻辑回归模型

In [76]:
# 定义tokenizer函数清理来自movie_data.csv文件，分解成单词，标记的同时去处停用词

In [77]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
               ' '.join(emoticons).replace('-', ''))
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

In [78]:
# 定义生成器函数stream_docs，每次读入并返回一个文档
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header next line
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [136]:
sd = stream_docs('movie_data.csv')

In [137]:
# 定义get_minibatch函数， 该函数调用steam_docs读入文件流返回大小由参数sze定义的文件
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

1. 因为需要把全部单词保存在内存，无法调用CountVectorizer函数做核心学习
2. TfridfVectorizer 需要把训练集的所有特征向量保存在内存，以计算逆文档频率
3. HashingVectorizer 用于文本处理并且独立于数据

In [157]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier # 随机梯度下降法
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, max_iter=1)

In [158]:
doc_stream = stream_docs(path='movie_data.csv')

In [159]:
import pyprind

In [160]:
pbar = pyprind.ProgBar(42)
classes = np.array([0, 1])
for _ in range(42):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:26


In [162]:
X_test, y_test = get_minibatch(doc_stream, size=5000)

In [164]:
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.873


## 8.5 具有潜在狄氏分配的主题建模
        
        潜在狄氏分配（LDA）

### 8.5.1 使用LDA分解文本文档
# 需要了解贝叶斯推理

　LDA是一种生成概率模型，试图找到经常出现在不同文档中的单词。假设每个文档都是由不同单词组成的混合体
那么经常出现的单词就代表着主题。LDA的输入是在本章前面讨论过的词袋模型。LDA将把词袋矩阵作为输入然后
分解成两个新矩阵
    1. 文档主题矩阵
    2. 单词主题矩阵
  LDA的主题数量三需要手动定义的朝参数

### 8.5.2 LDA与scikit-learn 

In [175]:
# LatentDirichletAllocation类

In [176]:
import pandas as pd

In [178]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')

In [180]:
# 使用CountVectorizer创建词袋矩阵作为LDA的输入
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

In [181]:
# LatenDirichletAllocations 评估器
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=10,
                                random_state=123,
                                learning_method='batch')

In [182]:
X_topics = lda.fit_transform(X)



In [184]:
# 包含10个主题的按升序排列的单词重要性 此处为5000
lda.components_.shape

(10, 5000)

In [257]:
# 显示10个主题中最重要的5个单词 注意要对主题阵列排序
n_top_words = 10
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print('主题 %d' % (topic_idx + 1))
    print(" ".join([feature_names[i] 
                    for i in topic.argsort()[::-1][:n_top_words]]))

主题 1
game music original animation disney children kids series cartoon king
主题 2
series book tv episode version read dvd original episodes novel
主题 3
guy girl house killer woman car goes dead killed kill
主题 4
human beautiful family audience feel war true cinema different documentary
主题 5
wife woman women role michael plays italian john sex husband
主题 6
comedy family school fun girl humor jokes kids laugh hilarious
主题 7
role performance actor john plays played performances job james robert
主题 8
action war fight effects space fi sci special star earth
主题 9
horror budget effects low gore special original video blood flick
主题 10
zone zombies zombie zero youth younger york yesterday yes yellow


In [261]:
horror = X_topics[:, 8].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx])


Horror movie #1:
Cut tries to be like most post-Scream slashers tried to be, a spoof of the horror genre that tried to be clever by referencing other famous horror movies. Now, I am not bagging 'Scream,' as I think 'Scream' is a very good horror movie that does a great job of blending horror and comedy. Cut fails on most levels. It has its moments but overall it just does not work out, not even as a "so bad it's good" movie, just a below average one.<br /><br />The first five minutes or so are OK and set the story fairly well, apart from the fact that Kylie Minogue can't really act, and ironically she gets her tongue out, go figure. Go forward some time and a group of film students want to finish her film off, which is apparently cursed. And, as you have probably predicted, one by one the cast and crew are slowly picked off by a masked madman.<br /><br />Unoriginal plot, poor acting and a predictable ending are a few of the elements that follow. There is plenty of referencing in the f