# 目标

这一节会介绍如何加载和清理IMDB电影评论数据，然后应用一些简单的词袋（Bag of Words）模型，来预测一个评论是赞还是踩。

# 读取数据

我们需要的第一个文件是unlabeledTrainData，里面包含了25000条IMDB电影评论，每一条评论都有一个表示情绪的正标签或负标签。
其中20000条为训练数据
5000条为测试数据

In [1]:
#load data
import pandas as pd
from sklearn import cross_validation # used to test classifier
from sklearn.cross_validation import KFold, cross_val_score, train_test_split

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [79]:
data = pd.read_csv('data/labeledTrainData.tsv', header=0,
                    delimiter='\t', quoting=3)

##### header=0表示文件的第一行包含列名，delimiter='\t'表示数据之间使用tab分隔的，quoting=3告诉python无视双引号，否则在读取文件的时候可能会报错。

##### 确保我们得到的是25000行，3列：
##### data 被分为x_train，y_train，x_test，y_test 

In [80]:
x_train = data['review'][:20000]

In [81]:
x_test = data['review'][20000:]

In [82]:
y_train = data['sentiment'][:20000]
y_test = data['sentiment'][20000:]

In [43]:
data.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

In [44]:
data.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [46]:
x_train[9]

'"<br /><br />This movie is full of references. Like \\"Mad Max II\\", \\"The wild one\\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."'

# 数据清洗和文本处理

使用BeautifulSoup来清理HTML标签：

In [10]:
from bs4 import BeautifulSoup

In [11]:
# 在一条评论上初始化一个BeautifulSoup对象
example1 = BeautifulSoup(X_train[0], 'lxml')

In [47]:
# 比较一下原始的文本和处理过后的文本的差别，通过调用get_text()得到处理后的结果
print(x_train[0])
print()
print(example1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

得到的结果已经没有了标签。对于标点符号，数字，stopwords：可以使用NLTK和正则表达式。

在处理标点的时候，通常情况是直接去除标点符号，但我们也要看是什么样的问题。比如这里我们要对评论进行情感判定，所以像"!!!" or ":-(" 这样的符号是会表达情绪的，应该保留。不过为了简单，这里就直接去除了，不过你可以自己尝试不同的方法。

同样的，我们还会去除数字，一个更好的方法是把所有数字表示为NUM。

接下来用正则表达式来处理标点符号和数字：

In [48]:
import re

In [49]:
letters_only = re.sub('[^a-zA-Z]', # The pattern to search for
                      ' ',         # The pattern to repalce it with
                      example1.get_text()) # The text to search

`[]`表示组成员，`^`表示not。话句话说，re.sub()的意思是，找到 不是a-z的小写，不是A-Z的大写，然后用空格替换。所以文本中标点符号和数字会被变为空格。

In [50]:
letters_only

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    m

In [51]:
lower_case = letters_only.lower() # Conver to lower case
words = lower_case.split() # Split into words

最后，我们需要可处理那些经常出现但没有什么实际意义的单词，即stop words。在英语中，像a, and, is, the这类词就属于stop words。我们可以从NLTK中导入一个stop word list：

In [52]:
from nltk.corpus import stopwords

In [54]:
stopwords.words('english')[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

从评论中取出stop words：

In [55]:
words = [w for w in words if not w in stopwords.words('english')]

In [56]:
words[:20]

['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'wiz',
 'watched',
 'moonwalker',
 'maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy']

现在把上面的所有步骤都整合在一起，写成一个函数：

In [57]:
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review, "lxml").get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words )) 

有两处新东西。第一，把stops变成一个集合是为了计算速度，因为searching set比searching list要快。第二，最后把所有单词整合到一段，这可以让输出的结果为之后的Bag for Words使用。

In [58]:
X_train_clean_review = review_to_words(x_train[3])
X_train_clean_review

'must assumed praised film greatest filmed opera ever read somewhere either care opera care wagner care anything except desire appear cultured either representation wagner swan song movie strikes unmitigated disaster leaden reading score matched tricksy lugubrious realisation text questionable people ideas opera matter play especially one shakespeare allowed anywhere near theatre film studio syberberg fashionably without smallest justification wagner text decided parsifal bisexual integration title character latter stages transmutes kind beatnik babe though one continues sing high tenor actors film singers get double dose armin jordan conductor seen face heard voice amfortas also appears monstrously double exposure kind batonzilla conductor ate monsalvat playing good friday music way transcendant loveliness nature represented scattering shopworn flaccid crocuses stuck ill laid turf expedient baffles theatre sometimes piece imperfections thoughts think syberberg splice parsifal gurneman

接下来我们用一个循环来把训练集中的所有评论全部清洗一遍

In [59]:
# number of reviews
num_reviews = X_train.size
print(num_reviews)
# initialize an empty list to hold the clean reviews
x_train_clean_train_reviews = []

20000


In [60]:

# loop over each review
for i in range(0, num_reviews):
    #call function for each one, and add the result to the new list
    x_train_clean_train_reviews.append(review_to_words(x_train[i]))

# 使用scikit-learn，从词袋中创建特征

一个方法就是Bag of words（词袋）。词袋模型会从所有的文档中学习出一个词汇表，然后计算每个单词在每个文档中出现的次数。例如，有下面两句话：

- Sentence 1: "The cat sat on the hat"
- Sentence 2: "The dog ate the cat and the hat"

有这两句话，我们可以得到一个词汇表：

    { the, cat, sat, on, hat, dog, ate, and }

为了得到词袋，我们计算每个单词在每个句子中出现的次数。例如在第一个句子中，the出现了两次，其他单词只出现一次，那么第一个句子的特征向量（feature vector）是：

- { the, cat, sat, on, hat, dog, ate, and }
- Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }

类似的，可以得到第二个句子的特征向量是：

-  { 3, 1, 0, 0, 1, 1, 1, 1}

对于IMDB数据，我们有很多评论，会得到一个非常大的词汇表。为了限制特征向量的大小，我们需要选择一个词汇表的大小。这里我们选择5000个最常出现的单词（注意我们已经去除了stop words）。

我们使用scikit-learn中的feature_extraction模块来创建bag-of-words feature。

In [61]:
print("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 5000) 

# fit_transform() does two functions: 
# First, it fits the model and learns the vocabulary; 
# second, it transforms our training data into feature vectors. 
# The input to fit_transform should be a list of strings.
train_data_features = vectorizer.fit_transform(x_train_clean_train_reviews)

# Numpy arrays are easy to work with, 
# so convert the result to an array
train_data_features = train_data_features.toarray()

Creating the bag of words...



#### 这里我们有25000行，每行5000个特征。
#### 注意其实CountVectorizer也可以直接做预处理，即去除stop words，做tokenizer等工作。
#### 现在词袋模型已经训练好了，看一下词汇表：

In [63]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
vocab[:20]

['abandoned',
 'abc',
 'abilities',
 'ability',
 'able',
 'abraham',
 'abrupt',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absurd',
 'abuse',
 'abusive',
 'abysmal',
 'academy',
 'accent',
 'accents',
 'accept',
 'acceptable']

In [64]:
import numpy as np

In [65]:
# sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print(count, tag)

151 abandoned
114 abc
81 abilities
376 ability
1035 able
69 abraham
63 abrupt
96 absence
66 absent
289 absolute
1181 absolutely
236 absurd
145 abuse
75 abusive
75 abysmal
228 academy
399 accent
161 accents
248 accept
109 acceptable
118 accepted
81 access
255 accident
166 accidentally
70 accompanied
96 accomplished
241 according
152 account
227 accurate
97 accused
141 achieve
114 achieved
104 achievement
66 acid
781 across
986 act
515 acted
5220 acting
2719 action
250 actions
71 activities
1923 actor
3601 actors
969 actress
288 actresses
319 acts
618 actual
3410 actually
115 ad
227 adam
80 adams
364 adaptation
68 adaptations
129 adapted
652 add
353 added
136 adding
280 addition
268 adds
91 adequate
65 admirable
96 admire
476 admit
102 admittedly
67 adolescent
63 adopted
84 adorable
406 adult
290 adults
80 advance
73 advanced
116 advantage
410 adventure
153 adventures
78 advertising
202 advice
69 advise
273 affair
75 affect
95 affected
84 afford
98 aforementioned
272 afraid
174 africa
20

503 christmas
336 christopher
79 christy
90 chuck
305 church
95 cia
174 cinderella
1218 cinema
329 cinematic
89 cinematographer
799 cinematography
78 circle
173 circumstances
72 cities
94 citizen
64 citizens
961 city
117 civil
77 civilization
175 claim
83 claimed
158 claims
137 claire
174 clark
710 class
80 classes
1442 classic
78 classical
184 classics
190 clean
624 clear
721 clearly
416 clever
73 cleverly
665 clich
77 cliche
95 cliff
73 climactic
350 climax
90 clint
128 clips
75 clock
1036 close
77 closed
116 closely
168 closer
75 closest
82 closet
145 closing
262 clothes
81 clothing
86 clown
345 club
182 clue
96 clues
81 clumsy
507 co
82 coach
65 coaster
190 code
83 coffee
84 coherent
451 cold
113 cole
270 collection
409 college
74 colonel
293 color
109 colorful
159 colors
108 colour
147 columbo
172 com
84 combat
192 combination
73 combine
166 combined
2564 come
131 comedian
253 comedic
365 comedies
2547 comedy
1965 comes
64 comfort
90 comfortable
729 comic
136 comical
98 comics
845

232 ford
193 foreign
147 forest
317 forever
566 forget
166 forgettable
99 forgive
152 forgot
274 forgotten
602 form
153 format
404 former
83 forms
189 formula
79 formulaic
144 forth
130 fortunately
111 fortune
83 forty
529 forward
153 foster
90 foul
2062 found
738 four
138 fourth
258 fox
179 frame
65 framed
191 france
110 franchise
77 francis
105 francisco
88 franco
357 frank
69 frankenstein
210 frankly
88 freak
211 fred
245 freddy
563 free
179 freedom
159 freeman
654 french
69 frequent
134 frequently
289 fresh
156 friday
1138 friend
152 friendly
1460 friends
230 friendship
160 frightening
490 front
102 frustrated
66 frustrating
84 frustration
223 fu
92 fulci
1415 full
71 fuller
339 fully
2151 fun
65 function
86 funeral
140 funnier
276 funniest
3404 funny
75 furious
84 furthermore
72 fury
756 future
99 futuristic
94 fx
75 gabriel
86 gadget
111 gag
217 gags
113 gain
1033 game
263 games
64 gandhi
352 gang
214 gangster
70 gangsters
382 garbage
87 garbo
94 garden
218 gary
163 gas
64 gather

2704 may
1906 maybe
82 mayor
1363 mean
385 meaning
122 meaningful
91 meaningless
628 means
491 meant
195 meanwhile
82 measure
96 meat
74 mechanical
255 media
124 medical
279 mediocre
100 medium
537 meet
184 meeting
547 meets
95 mel
146 melodrama
97 melodramatic
78 melting
271 member
438 members
528 memorable
226 memories
250 memory
1552 men
87 menace
100 menacing
259 mental
121 mentally
643 mention
451 mentioned
66 mentioning
69 mentions
147 mere
291 merely
92 merit
68 merits
73 meryl
524 mess
674 message
109 messages
68 messed
227 met
147 metal
64 metaphor
84 method
71 methods
162 mexican
153 mexico
159 mgm
1096 michael
141 michelle
83 mickey
259 mid
766 middle
157 midnight
2359 might
75 mighty
109 miike
226 mike
105 mild
134 mildly
87 mildred
101 mile
208 miles
354 military
64 milk
89 mill
132 miller
304 million
125 millions
69 min
1633 mind
120 minded
123 mindless
158 minds
220 mine
185 mini
90 minimal
65 minimum
322 minor
619 minute
2371 minutes
78 miracle
144 mirror
117 miscast
81

70 ruby
64 rude
174 ruin
181 ruined
65 ruins
69 rukh
138 rule
194 rules
999 run
792 running
414 runs
84 rural
117 rush
107 rushed
170 russell
221 russian
108 ruth
83 ruthless
177 ryan
65 sabrina
96 sacrifice
791 sad
93 sadistic
453 sadly
87 sadness
200 safe
84 safety
80 saga
1730 said
196 sake
115 sally
376 sam
110 samurai
153 san
122 sandler
75 sandra
215 santa
73 sappy
167 sarah
237 sat
111 satan
208 satire
84 satisfied
76 satisfy
176 satisfying
172 saturday
108 savage
819 save
216 saved
110 saves
203 saving
2519 saw
4298 say
749 saying
861 says
170 scale
178 scare
71 scarecrow
242 scared
154 scares
797 scary
155 scenario
4296 scene
318 scenery
4220 scenes
86 scheme
1359 school
529 sci
433 science
91 scientific
277 scientist
109 scientists
117 scooby
82 scope
829 score
69 scores
66 scorsese
65 scotland
462 scott
78 scottish
216 scream
200 screaming
96 screams
2025 screen
144 screening
564 screenplay
133 screenwriter
2408 script
102 scripted
120 scripts
71 scrooge
216 sea
109 seagal
2

125 virgin
79 virginia
174 virtually
114 virus
69 visible
246 vision
64 visions
206 visit
419 visual
201 visually
205 visuals
93 vivid
928 voice
73 voiced
168 voices
87 voight
128 von
182 vote
221 vs
64 vulgar
69 vulnerable
70 wacky
67 wagner
567 wait
77 waited
444 waiting
66 waitress
113 wake
406 walk
148 walked
116 walken
86 walker
350 walking
172 walks
293 wall
82 wallace
101 walls
67 walsh
187 walter
74 wandering
71 wang
132 wanna
93 wannabe
2987 want
1065 wanted
253 wanting
1040 wants
1681 war
102 ward
184 warm
87 warming
66 warmth
128 warn
137 warned
142 warner
95 warren
99 warrior
67 warriors
243 wars
230 washington
1203 waste
450 wasted
116 wasting
5594 watch
259 watchable
1790 watched
96 watches
3675 watching
433 water
82 waters
65 watson
142 wave
77 waves
6467 way
195 wayne
636 ways
623 weak
75 weakest
85 wealth
118 wealthy
118 weapon
135 weapons
145 wear
271 wearing
135 wears
90 web
86 website
250 wedding
378 week
167 weekend
153 weeks
105 weight
542 weird
173 welcome
8462 w

# Random Forest 随机森林

我们已经从词袋中得到了特征，接下来用随机森林作为模型看一下效果如何。这里使用设定树的数量为100个：

In [69]:
print("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train_data_features, y_train)

Training the random forest...


# Creating a Submission 创建提交

下面的内容是把训练好的随机森林模型用在测试集上，并创建一个提交文件。

注意，当我们把词袋用于测试集上的时候，我们只调用transform，而不是fit_transform，因为后者是用来训练的。我们不能把测试集用于训练，不然会过拟合。

In [88]:
x_test = data['review'][:5000]

In [90]:
x_test[1]

'"\\"The Classic War of the Worlds\\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells\' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \\"critics\\" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the \\"critics\\". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells\' classic novel, and we found it to be very entertaining. This made it easy to overlook what the \\"critics\\" perceive to be its shortcomings."'

In [92]:

# Verify that there are 25,000 rows and 2 columns
print(x_test.shape)

# Create an empty list and append the clean reviews one by one
num_reviews = len(x_test)
x_test_clean_test_reviews = [] 

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,num_reviews):
    
    clean_review = review_to_words(x_test[i])
    x_test_clean_test_reviews.append(clean_review )

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(x_test_clean_test_reviews)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv( "result/Bag_of_Words_model.csv", index=False, quoting=3 )

(5000,)
Cleaning and parsing the test set movie reviews...



In [102]:
data['sentiment'][:5000].values

array([1, 1, 0, ..., 0, 1, 1])

In [103]:
abs(result-data['sentiment'][:5000].values).sum()

0

In [104]:
result

array([1, 1, 0, ..., 0, 1, 1])

### cross-validation的正确率为100%