学习如何加载和清理IMDB电影评论数据，然后应用一些简单的词袋（Bag of Words）模型，来预测一个评论是赞还是踩。

In [3]:
! unzip /kaggle/input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip
! unzip /kaggle/input/word2vec-nlp-tutorial/testData.tsv.zip
! unzip /kaggle/input/word2vec-nlp-tutorial/unlabeledTrainData.tsv.zip

Archive:  /kaggle/input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip
  inflating: labeledTrainData.tsv    
Archive:  /kaggle/input/word2vec-nlp-tutorial/testData.tsv.zip
  inflating: testData.tsv            
Archive:  /kaggle/input/word2vec-nlp-tutorial/unlabeledTrainData.tsv.zip
  inflating: unlabeledTrainData.tsv  


使用pandas来读取数据labeledTrainData，里面包含了25000条IMDB电影评论，每一条评论都有一个表示情绪的正标签或负标签。

In [5]:
import pandas as pd
train = pd.read_csv('/kaggle/working/labeledTrainData.tsv', header=0,
                    delimiter='\t', quoting=3)

header=0表示文件的第一行包含列名，delimiter='\t'表示数据之间使用tab分隔的，quoting=3告诉python无视双引号，否则在读取文件的时候可能会报错。

确保我们得到的是25000行，3列：

In [6]:
train.shape

(25000, 3)

获取 train的列名，并将其作为一个NumPy数组返回。

In [7]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

显示train的前五行数据。

In [8]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


获取review列的第一个元素，即第一条影评的文本：

In [9]:
train['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

**数据清洗和文本处理**

使用BeautifulSoup来清理HTML标签：

In [10]:
from bs4 import BeautifulSoup

In [11]:
# 在一条评论上初始化一个BeautifulSoup对象
example1 = BeautifulSoup(train['review'][0], 'lxml')

In [12]:
# 比较一下原始的文本和处理过后的文本的差别，通过调用get_text()得到处理后的结果
print(train['review'][0])
print()
print(example1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

得到的结果已经没有了标签。对于标点符号，数字，stopwords：可以使用NLTK和正则表达式。

在处理标点的时候，通常情况是直接去除标点符号，但我们也要看是什么样的问题。比如这里我们要对评论进行情感判定，所以像"!!!" or ":-(" 这样的符号是会表达情绪的，应该保留。不过为了简单，这里就直接去除了，不过你可以自己尝试不同的方法。

同样的，我们还会去除数字，一个更好的方法是把所有数字表示为NUM。

接下来用正则表达式来处理标点符号和数字：

In [13]:
import re

In [14]:
letters_only = re.sub('[^a-zA-Z]', # The pattern to search for
                      ' ',         # The pattern to repalce it with
                      example1.get_text()) # The text to search

In [15]:
letters_only

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    m

正则表达式的意思。[]表示组成员，^表示not。话句话说，re.sub()的意思是，找到 不是a-z的小写，不是A-Z的大写，然后用空格替换。所以文本中标点符号和数字会被变为空格。

然后把所有单词变为小写，然后分割为独立的单词（使用tokenization）：

In [17]:
lower_case = letters_only.lower() # Conver to lower case
words = lower_case.split() # Split into words

最后，我们需要可处理那些经常出现但没有什么实际意义的单词，即stop words。在英语中，像a, and, is, the这类词就属于stop words。我们可以从NLTK中导入一个stop word list：

In [18]:
from nltk.corpus import stopwords

In [19]:
stopwords.words('english')[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

从评论中取出stop words：

In [20]:
words = [w for w in words if not w in stopwords.words('english')]
words[:20]

['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'wiz',
 'watched',
 'moonwalker',
 'maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy']

现在把上面的所有步骤都整合在一起，写成一个函数：

In [24]:
def review_to_words(raw_review):
    # 函数用于将原始影评转换为单词字符串
    # 输入是一个字符串（一个原始电影评论），输出也是一个字符串（一个预处理过的电影评论）
    # 1. 移除HTML
    review_text = BeautifulSoup(raw_review, "lxml").get_text() 
    #
    # 2. 移除非字母字符       
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. 转换为小写，分割成单个单词
    words = letters_only.lower().split()                             
    #
    # 4. 在Python中，搜索集合比搜索列表快得多，因此将停用词转换为集合
    stops = set(stopwords.words("english"))                  
    # 
    # 5. 去除 stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. 将单词重新组合成一个由空格分隔的字符串，并返回结果
    return( " ".join( meaningful_words )) 

有两处不一样。第一，把stops变成一个集合是为了计算速度，因为searching set比searching list要快。第二，最后把所有单词整合到一段，这可以让输出的结果为之后的Bag for Words使用。

然后我们只调用函数就可以了：

In [22]:
clean_review = review_to_words(train['review'][0])
clean_review

'stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate workin

用一个循环来把训练集中的所有评论全部清洗一遍

In [25]:
# 影评的数量
num_reviews = train['review'].size

# 初始化一个空列表来存储清洗后的影评
clean_train_reviews = []

# 遍历每条影评
for i in range(0, num_reviews):
    #对每条影评调用函数，并将结果添加到新列表中
    clean_train_reviews.append( review_to_words( train['review'][i]))

  review_text = BeautifulSoup(raw_review, "lxml").get_text()


将每条影评通过 review_to_words 函数进行预处理，并将预处理后的结果存储在列表 clean_train_reviews 中。同时，在处理每1000条影评时打印一条消息，以显示处理进度

In [26]:
print("Cleaning and parsing the training set movie reviews...\n")
clean_train_reviews = []
for i in range( 0, num_reviews ):
    # 如果索引能被1000整除，则打印一条消息
    if( (i+1)%1000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))                                                                  
    clean_train_reviews.append( review_to_words( train["review"][i] ))

Cleaning and parsing the training set movie reviews...



  review_text = BeautifulSoup(raw_review, "lxml").get_text()


Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



**使用scikit-learn，从词袋中创建特征**

现在我们已经有了处理后的评论，如何把这些评论变为能被机器学习利用数值呢？

一个方法就是Bag of words（词袋）。词袋模型会从所有的文档中学习出一个词汇表，然后计算每个单词在每个文档中出现的次数。例如，有下面两句话：

* Sentence 1: "The cat sat on the hat"

* Sentence 2: "The dog ate the cat and the hat"
有这两句话，我们可以得到一个词汇表：

{ the, cat, sat, on, hat, dog, ate, and }

为了得到词袋，我们计算每个单词在每个句子中出现的次数。例如在第一个句子中，the出现了两次，其他单词只出现一次，那么第一个句子的特征向量（feature vector）是：

* { the, cat, sat, on, hat, dog, ate, and }

* Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }

类似的，可以得到第二个句子的特征向量是：

* { 3, 1, 0, 0, 1, 1, 1, 1}

对于IMDB数据，我们有很多评论，会得到一个非常大的词汇表。为了限制特征向量的大小，我们需要选择一个词汇表的大小。这里我们选择5000个最常出现的单词（注意我们已经去除了stop words）。

我们使用scikit-learn中的feature_extraction模块来创建bag-of-words feature。


In [28]:
print("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# 初始化 "CountVectorizer" 对象，这是 scikit-learn 的词袋模型工具。
vectorizer = CountVectorizer(analyzer = "word",  # 指定分析器在单词级别上工作，即按单词分割文本
                             tokenizer = None,  # 不使用额外的分词器，因为 analyzer 已经指定了分词方式
                             preprocessor = None, # 不使用预处理器，因为文本预处理已经在之前的步骤中完成
                             stop_words = None, # 不过滤停用词，因为停用词已经在之前的步骤中移除  
                             max_features = 5000) # 限制词汇表的最大特征数为 5000，即只保留出现频率最高的 5000 个单词

# fit_transform() 执行两个功能：
# 首先，它拟合模型并学习词汇表；
# 其次，它将我们的训练数据转换成特征向量。
# fit_transform 的输入应该是一个字符串列表。
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy 数组易于处理，因此将结果转换为数组
train_data_features = train_data_features.toarray()

Creating the bag of words...



In [29]:
print(train_data_features.shape)

(25000, 5000)


这里我们有25000行，每行5000个特征。

其实CountVectorizer也可以直接做预处理，即去除stop words，做tokenizer等工作。

现在词袋模型已经训练好了，看一下词汇表：

In [30]:
vocab = vectorizer.get_feature_names_out()
vocab[:20]

array(['abandoned', 'abc', 'abilities', 'ability', 'able', 'abraham',
       'absence', 'absent', 'absolute', 'absolutely', 'absurd', 'abuse',
       'abusive', 'abysmal', 'academy', 'accent', 'accents', 'accept',
       'acceptable', 'accepted'], dtype=object)

In [32]:
import numpy as np

**Random Forest 随机森林**

我们已经从词袋中得到了特征，接下来用随机森林作为模型看一下效果如何。这里使用设定树的数量为100个：

In [33]:
print("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier

# 初始化一个包含100棵树的随机森林分类器
forest = RandomForestClassifier(n_estimators = 100) 

# 使用词袋模型作为特征，情感标签作为响应变量，将森林拟合到训练集
forest = forest.fit( train_data_features, train["sentiment"] )

Training the random forest...


**Creating a Submission 创建提交**

使用之前训练好的随机森林模型进行情感预测，并创建一个提交文件。

In [34]:
# 读取测试数据
test = pd.read_csv("/kaggle/working/testData.tsv", header=0, delimiter="\t",
                   quoting=3 )

# 验证数据集有 25,000 行和 2 列
print(test.shape)

# 创建一个空列表，并逐个添加清洗后的影评
num_reviews = len(test["review"])
clean_test_reviews = [] 

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,num_reviews):
    if( (i+1) % 1000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))
    clean_review = review_to_words( test["review"][i] )
    clean_test_reviews.append( clean_review )

# 为测试集获取词袋模型，并将结果转换为 numpy 数组
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# 使用随机森林进行情感标签预测
result = forest.predict(test_data_features)

# 将结果复制到一个包含 "id" 列和 "sentiment" 列的 pandas DataFrame 中
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

# 使用 pandas 将结果写入逗号分隔的输出文件
output.to_csv( "/kaggle/working/submission.csv", index=False, quoting=3 )

(25000, 2)
Cleaning and parsing the test set movie reviews...



  review_text = BeautifulSoup(raw_review, "lxml").get_text()


Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000

