# 处理文本数据

## 使用字符串表示数据类型

* 字符串的数据类型
    * 分类数据
        * 取一个数据集中的数据，绘制直方图，如果总是那么几个在重复出现，那么可以说明是分类数据（下拉式选项）
    * 可以在语义上映射为类别的字符串
        * 区别于分类数据的固定表示，语义上相互映射是表示比如说黑色，可以说黑色，也可以说是白色的相对色，午夜的颜色，等等，这些都指向了黑色，所以构成映射关系。（文本框填入）
    * 结构化字符串数据
        * 顾名思义
    * 文本数据
        * 顾名思义

## 示例应用： 电影评论的情感分析

In [18]:
# 导包
from pprint import pprint
import numpy as np
import mglearn

In [25]:
!tree D:\Python\MechineLearing\Python机器学习基础教程(学习笔记)\aclImdb

Folder PATH listing
Volume serial number is 000057BF 20C2:B180
D:\PYTHON\MECHINELEARING\PYTHON机器学习基础教程(学习笔记)\ACLIMDB
├───test
│   ├───neg
│   └───pos
└───train
    ├───neg
    ├───pos
    └───unsup


In [5]:
from sklearn.datasets import load_files
print("It's begining")
reviews_train = load_files("D:\\Python\\MechineLearing\\Python机器学习基础教程(学习笔记)\\aclImdb\\train\\")
print("Yes")

It's begining
Yes


In [10]:
text_train, y_train = reviews_train.data, reviews_train.target
print("type of test_train: {}".format(type(reviews_train)))
print("length of text_train: {}".format(len(reviews_train)))
pprint("text_train[1]: {}".format(text_train[1]))

type of test_train: <class 'sklearn.utils.Bunch'>
length of text_train: 5
('text_train[1]: b"Amount of disappointment I am getting these days seeing '
 'movies like Partner, Jhoom Barabar and now, Heyy Babyy is gonna end my habit '
 'of seeing first day shows.<br /><br />The movie is an utter disappointment '
 'because it had the potential to become a laugh riot only if the '
 "d\\xc3\\xa9butant director, Sajid Khan hadn't tried too many things. Only "
 'saving grace in the movie were the last thirty minutes, which were seriously '
 'funny elsewhere the movie fails miserably. First half was desperately been '
 "tried to look funny but wasn't. Next 45 minutes were emotional and looked "
 'totally artificial and illogical.<br /><br />OK, when you are out for a '
 "movie like this you don't expect much logic but all the flaws tend to appear "
 "when you don't enjoy the movie and thats the case with Heyy Babyy. Acting is "
 'good but thats not enough to keep one interested.<br /><br />For 

In [12]:
# 替换数据
text_train = [doc.replace(b'<br />', b' ') for doc in text_train]

In [15]:
print("Sample per class (training): {}".format(np.bincount(y_train)))

Sample per class (training): [12500 12500 50000]


In [16]:
print("It's begining")
reviews_test = load_files("D:\\Python\\MechineLearing\\Python机器学习基础教程(学习笔记)\\aclImdb\\test\\")
print("Yes")

It's begining
Yes


In [17]:
text_test, y_test = reviews_test.data, reviews_test.target
print("Number of documents in test data: {}".format(len(text_test)))
print("Sample per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b'<br />', b' ') for doc in text_test]

Number of documents in test data: 25000
Sample per class (test): [12500 12500]


## 将文本数据表示为词袋
* 计算每个单词在文本中出现的频率，次数
* 步骤:
    1. 分词: 按照空格，标点划分
    2. 构建词袋: 收集一个词袋，里面包含出现在任意文档中的所有词
    3. 编码: 对于每个文档计算词表中每个单词在这个文档中的出现频率

### 将词袋作用于玩具数据集

In [24]:
bards_word = ["The fool doth think he is wise",
              "but the wise man knows himself to be a fool"]

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_word)


CountVectorizer()

In [27]:
print("Vect size: {}".format(len(vect.vocabulary_)))
pprint("Vocabulary content: {}".format(vect.vocabulary_))

Vect size: 13
("Vocabulary content: {'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, "
 "'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, "
 "'be': 0}")


In [30]:
bag_of_words = vect.transform(bards_word)
print("bag_of_words: {}".format(repr(bag_of_words)))

bag_of_words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>


* 这里不选择保存成为数组的形式是因为，保存数组的形式太消耗内存了，词袋相对于文档的单词而言太过于巨大，这会导致
文档单词向词袋映射总是呈现稀疏矩阵的样子，0的储存太消耗内存了

In [34]:
print(bag_of_words.toarray())

[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


* 有多少行实际显示就有几行，词袋的大小总是固定的，对于所有的文档而言都是如此，换句话说所有文档单词的集合都是词袋的子集

In [37]:
print("It's beginning")
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train: \n{}".format(repr(X_train)))
print("End")

It's beginning
X_train: 
<75000x124255 sparse matrix of type '<class 'numpy.int64'>'
	with 10315542 stored elements in Compressed Sparse Row format>
End


In [45]:
feature_name = vect.get_feature_names()
print("The number of features: {}".format(len(feature_name)))
pprint("First 20 features: {}".format(feature_name[:20]))
pprint("Every 2000th feature: \n{}".format(feature_name[::2000]))

The number of features: 124255
("First 20 features: ['00', '000', '0000', "
 "'0000000000000000000000000000000001', '0000000000001', '000000001', "
 "'000000003', '00000001', '000001745', '00001', '0001', '00015', '0002', "
 "'0007', '00083', '000ft', '000s', '000th', '001', '002']")
('Every 2000th feature: \n'
 "['00', '_require_', 'aideed', 'announcement', 'asteroid', 'banquière', "
 "'besieged', 'bollwood', 'btvs', 'carboni', 'chcialbym', 'clotheth', "
 "'consecration', 'cringeful', 'deadness', 'devagan', 'doberman', 'duvall', "
 "'endocrine', 'existent', 'fetiches', 'formatted', 'garard', 'godlie', "
 "'gumshoe', 'heathen', 'honoré', 'immatured', 'interested', 'jewelry', "
 "'kerchner', 'köln', 'leydon', 'lulu', 'mardjono', 'meistersinger', "
 "'misspells', 'mumblecore', 'ngah', 'oedpius', 'overwhelmingly', 'penned', "
 "'pleading', 'previlage', 'quashed', 'recreating', 'reverent', 'ruediger', "
 "'sceme', 'settling', 'silveira', 'soderberghian', 'stagestruck', 'subprime', "
 "'tab

<75000x124255 sparse matrix of type '<class 'numpy.int64'>'
	with 10315542 stored elements in Compressed Sparse Row format>

In [68]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

print("It's begining")
logreg = LogisticRegression()
scores = cross_val_score(logreg, X_train, y_train, cv=5)
print("May be it's already done!")
print(scores)

It's begining
May be it's already done!
[0.66386667 0.66266667 0.6682     0.6736     0.68073333]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [69]:

from sklearn.model_selection import GridSearchCV

print("It's beginning!")
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(logreg, param_grid, cv=5)
grid.fit(X_train, y_train)
print("End")

It's beginning!
End


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [70]:
print("Best parameters: {}".format(grid.best_params_))
print("Best score: {}".format(grid.best_score_))

Best parameters: {'C': 0.01}
Best score: 0.6806133333333333


In [72]:
print("It's beginning")
logreg = LogisticRegression(C=0.01, max_iter=1000)
logreg.fit(X_train, y_train)
print(logreg.score(X_train, y_train))

It's beginning
0.7710266666666666
