In [2]:
# dataset location url=http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [3]:
import re
def rm_tags(text):
    # re_tag = re.compile(r"<[^>]+>")
    # re_tag.sub("", text)
    return re.sub(r"<[^>]+>", "", text)

In [4]:
import os
def read_files(filetype):
    path = "data/aclImdb/"
    file_list = []
    
    positive_path = path + filetype + "/pos/"
    for f in os.listdir(positive_path):
        file_list.append(positive_path + f)
    
    negative_path = path + filetype + "/neg/"
    for f in os.listdir(negative_path):
        file_list.append(negative_path + f)
        
    print("read", filetype, " files:", len(file_list))
    
    all_labels = ([1] * 12500 + [0] * 12500)
    
    all_texts = []
    for file in file_list:
        with open(file, "r", encoding="utf-8") as file_input:
            all_texts.append(rm_tags(" ".join(file_input.readlines())))
    
    return all_labels, all_texts

In [5]:
y_train, train_text = read_files("train")

read train  files: 25000


In [6]:
y_test, test_text = read_files("test")

read test  files: 25000


In [7]:
train_text[2]

'Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I\'m a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often).'

In [8]:
y_train[2]

1

In [9]:
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [10]:
token = Tokenizer(num_words=2000)
token.fit_on_texts(train_text)

In [11]:
print(token.document_count)

25000


In [12]:
# token.word_docs     quantity of reviews that the word exists in
# token.word_counts   quantity of words in all review
# token.word_index    give a number to the word according to the quantity

In [13]:
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

In [14]:
print(x_train_seq[0])

[308, 6, 3, 1068, 208, 8, 29, 1, 168, 54, 13, 45, 81, 40, 391, 109, 137, 13, 57, 149, 7, 1, 481, 68, 5, 260, 11, 6, 72, 5, 631, 70, 6, 1, 5, 1, 1530, 33, 66, 63, 204, 139, 64, 1229, 1, 4, 1, 222, 899, 28, 68, 4, 1, 9, 693, 2, 64, 1530, 50, 9, 215, 1, 386, 7, 59, 3, 1470, 798, 5, 176, 1, 391, 9, 1235, 29, 308, 3, 352, 343, 142, 129, 5, 27, 4, 125, 1470, 5, 308, 9, 532, 11, 107, 1466, 4, 57, 554, 100, 11, 308, 6, 226, 47, 3, 11, 8, 214]


In [15]:
print(train_text[0])

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


In [16]:
# change length of every review to 100 

x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test = sequence.pad_sequences(x_test_seq, maxlen=100)

In [17]:
print("before", len(x_train_seq[2]))
print("after", len(x_train[2]))

before 115
after 100


In [18]:
# build model

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding

In [19]:
model = Sequential()

model.add(Embedding(output_dim=32, input_dim=2000, input_length=100))
model.add(Dropout(0.2))

model.add(Flatten())
model.add(Dense(units=256, activation="relu"))
model.add(Dropout(0.35))

model.add(Dense(units=1, activation="sigmoid"))

In [20]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               819456    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________
None

In [21]:
# training model

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 4s - loss: 0.4823 - acc: 0.7559 - val_loss: 0.5923 - val_acc: 0.7210
Epoch 2/10
 - 4s - loss: 0.2701 - acc: 0.8877 - val_loss: 0.4904 - val_acc: 0.7812
Epoch 3/10
 - 4s - loss: 0.1643 - acc: 0.9381 - val_loss: 0.6174 - val_acc: 0.7674
Epoch 4/10
 - 4s - loss: 0.0865 - acc: 0.9710 - val_loss: 0.7340 - val_acc: 0.7702
Epoch 5/10
 - 4s - loss: 0.0480 - acc: 0.9840 - val_loss: 1.1631 - val_acc: 0.7198
Epoch 6/10
 - 4s - loss: 0.0333 - acc: 0.9891 - val_loss: 1.2418 - val_acc: 0.7358
Epoch 7/10
 - 4s - loss: 0.0298 - acc: 0.9891 - val_loss: 1.1763 - val_acc: 0.7574
Epoch 8/10
 - 4s - loss: 0.0248 - acc: 0.9917 - val_loss: 1.1043 - val_acc: 0.7790
Epoch 9/10
 - 4s - loss: 0.0254 - acc: 0.9908 - val_loss: 1.0688 - val_acc: 0.7880
Epoch 10/10
 - 4s - loss: 0.0235 - acc: 0.9919 - val_loss: 1.0025 - val_acc: 0.8066


In [22]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.81503999999999999

In [23]:
prediction = model.predict_classes(x_test)
prediction[:10]



array([[1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1]])

In [25]:
predict_classes = prediction.reshape(-1)
predict_classes[:10]

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1])

In [30]:
SentimentDict = {1: "正面", 0:"負面"}

def display_test_Sentiment(i):
    print(test_text[i])
    print("label: ", SentimentDict[y_test[i]], "predict: ", SentimentDict[predict_classes[i]])

In [31]:
display_test_Sentiment(2)

As a recreational golfer with some knowledge of the sport's history, I was pleased with Disney's sensitivity to the issues of class in golf in the early twentieth century. The movie depicted well the psychological battles that Harry Vardon fought within himself, from his childhood trauma of being evicted to his own inability to break that glass ceiling that prevents him from being accepted as an equal in English golf society. Likewise, the young Ouimet goes through his own class struggles, being a mere caddie in the eyes of the upper crust Americans who scoff at his attempts to rise above his standing. What I loved best, however, is how this theme of class is manifested in the characters of Ouimet's parents. His father is a working-class drone who sees the value of hard work but is intimidated by the upper class; his mother, however, recognizes her son's talent and desire and encourages him to pursue his dream of competing against those who think he is inferior.Finally, the golf scenes

In [32]:
input_text = """
I don't usually do reviews but this film was such a huge disappointment I couldn't fight it anymore. The original movie was so good and, considering this is the exact same movie, there was really not much that could go wrong. In theory of course, because in reality the final result is just soulless. Everything feels fake. From Emma Watson's acting to the cgi and the props. I love Emma Watson but in this film she is just playing herself trying to play Belle. To be fair though, no one in the film actually manages to instill the characters with the same emotion and personality as the original except maybe Josh Gad. Luke Evans is very good but his character is written as a villain from the start, while in the original he evolves from slightly annoying to "evil". Which brings me to my next point: the awful writing. The film treats the audience like we're stupid and needs to explain everything verbally instead of just letting things show through the action. The characters are one dimensional and don't change as the story unfolds. Gaston is the villain. The Beast is just a misunderstood soul from the beginning despite the prologue telling us otherwise. Even his bad temper is watered down. The writers had already the script written for them, all they had to do was add a few more lines here and there and create two or three scenes that would blend in seamlessly with the original (since, I repeat, they chose to use almost word for word the 1991 script with minor changes). Well, the new dialogue feels very wooden and unnatural. The new scenes add nothing to the story and, even though the creators try to answer some questions we have from the original, in the end they create new plot holes that go unanswered. I miss the subtlety of the 1991 film in which every expression, every line and every pause added something either to the progression of the story or the characterization of the heroes without anything feeling forced. I keep mentioning the original a lot but that is because this movie has nothing new to offer really, so I can't fully separate it from the 1991 one. In the end, what annoys me the most is that the 2017 remake had great potential to become a new classic and stand on its own had it been handled a little differently and not with a rushed "let's make some good money" mentality. There are very few good things about this movie, one of which is the music which is simply magical and manages to convey all the emotions the actors can't. Then there is the ending (after the transformation) where there is a more realistic touch as the villagers remember their friends-relatives that work at the castle and are finally reunited. Overall, despite the enormous hype, the movie just makes the original stand out even more as a timeless film that won't be surpassed by another adaptation any time soon.
"""

In [36]:
def predict_review(input_text):
    input_seq = token.texts_to_sequences([input_text])
    pad_input_seq = sequence.pad_sequences(input_seq, maxlen=100)
    predict_result = model.predict_classes(pad_input_seq)
    print(SentimentDict[predict_result[0][0]])

In [37]:
predict_review(input_text)

正面


In [38]:
# change the model to 3800 words

token = Tokenizer(num_words=3800)
token.fit_on_texts(train_text)

x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)

In [39]:
model = Sequential()

model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
model.add(Dropout(0.2))

model.add(Flatten())
model.add(Dense(units=256, activation="relu"))
model.add(Dropout(0.2))

model.add(Dense(units=1, activation="sigmoid"))

In [40]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_3 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 12160)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               3113216   
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 257       
Total params: 3,235,073
Trainable params: 3,235,073
Non-trainable params: 0
_________________________________________________________________


In [41]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 15s - loss: 0.4685 - acc: 0.7647 - val_loss: 0.4073 - val_acc: 0.8280
Epoch 2/10
 - 15s - loss: 0.1955 - acc: 0.9235 - val_loss: 0.4471 - val_acc: 0.8174
Epoch 3/10
 - 15s - loss: 0.0760 - acc: 0.9748 - val_loss: 0.6651 - val_acc: 0.7860
Epoch 4/10
 - 16s - loss: 0.0286 - acc: 0.9918 - val_loss: 0.9089 - val_acc: 0.7726
Epoch 5/10
 - 15s - loss: 0.0133 - acc: 0.9966 - val_loss: 0.9777 - val_acc: 0.7892
Epoch 6/10
 - 15s - loss: 0.0099 - acc: 0.9973 - val_loss: 0.8830 - val_acc: 0.8158
Epoch 7/10
 - 15s - loss: 0.0110 - acc: 0.9967 - val_loss: 1.0538 - val_acc: 0.7936
Epoch 8/10
 - 15s - loss: 0.0098 - acc: 0.9967 - val_loss: 1.0405 - val_acc: 0.8050
Epoch 9/10
 - 15s - loss: 0.0123 - acc: 0.9958 - val_loss: 1.7423 - val_acc: 0.7228
Epoch 10/10
 - 15s - loss: 0.0146 - acc: 0.9950 - val_loss: 1.3401 - val_acc: 0.7748


In [42]:
def predict_review(input_text):
    input_seq = token.texts_to_sequences([input_text])
    pad_input_seq = sequence.pad_sequences(input_seq, maxlen=380)
    predict_result = model.predict_classes(pad_input_seq)
    print(SentimentDict[predict_result[0][0]])

In [43]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.84360000000000002

In [44]:
# build a RNN model

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN

In [45]:
model = Sequential()

model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
model.add(Dropout(0.35))

model.add(SimpleRNN(units=16))

model.add(Dense(units=256, activation="relu"))
model.add(Dropout(0.35))

model.add(Dense(units=1, activation="sigmoid"))

In [46]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_5 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 16)                784       
_________________________________________________________________
dense_5 (Dense)              (None, 256)               4352      
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 257       
Total params: 126,993
Trainable params: 126,993
Non-trainable params: 0
_________________________________________________________________
None

In [47]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 12s - loss: 0.5302 - acc: 0.7296 - val_loss: 0.4790 - val_acc: 0.8022
Epoch 2/10
 - 12s - loss: 0.3603 - acc: 0.8510 - val_loss: 0.5809 - val_acc: 0.7570
Epoch 3/10
 - 12s - loss: 0.2895 - acc: 0.8831 - val_loss: 0.4606 - val_acc: 0.8198
Epoch 4/10
 - 12s - loss: 0.2531 - acc: 0.9017 - val_loss: 0.7098 - val_acc: 0.7358
Epoch 5/10
 - 12s - loss: 0.2055 - acc: 0.9220 - val_loss: 0.7631 - val_acc: 0.7348
Epoch 6/10
 - 12s - loss: 0.1576 - acc: 0.9413 - val_loss: 0.7441 - val_acc: 0.7678
Epoch 7/10
 - 12s - loss: 0.1259 - acc: 0.9553 - val_loss: 0.8328 - val_acc: 0.7696
Epoch 8/10
 - 12s - loss: 0.1079 - acc: 0.9605 - val_loss: 0.6824 - val_acc: 0.8094
Epoch 9/10
 - 12s - loss: 0.0898 - acc: 0.9677 - val_loss: 0.9759 - val_acc: 0.7446
Epoch 10/10
 - 12s - loss: 0.0780 - acc: 0.9717 - val_loss: 0.8336 - val_acc: 0.7928


In [48]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.83348

In [49]:
# build a LSTM model

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

In [50]:
model = Sequential()

model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
model.add(Dropout(0.2))

model.add(LSTM(32))

model.add(Dense(units=256, activation="relu"))
model.add(Dropout(0.2))

model.add(Dense(units=1, activation="sigmoid"))

In [51]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_7 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_7 (Dense)              (None, 256)               8448      
_________________________________________________________________
dropout_8 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 257       
Total params: 138,625
Trainable params: 138,625
Non-trainable params: 0
_________________________________________________________________
None

In [52]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 36s - loss: 0.4923 - acc: 0.7522 - val_loss: 0.4399 - val_acc: 0.7970
Epoch 2/10
 - 35s - loss: 0.2763 - acc: 0.8868 - val_loss: 0.5999 - val_acc: 0.7250
Epoch 3/10
 - 36s - loss: 0.2260 - acc: 0.9131 - val_loss: 0.4606 - val_acc: 0.8182
Epoch 4/10
 - 35s - loss: 0.1964 - acc: 0.9257 - val_loss: 0.3680 - val_acc: 0.8396
Epoch 5/10
 - 35s - loss: 0.1819 - acc: 0.9306 - val_loss: 0.2851 - val_acc: 0.8872
Epoch 6/10
 - 35s - loss: 0.1630 - acc: 0.9391 - val_loss: 0.3927 - val_acc: 0.8462
Epoch 7/10
 - 35s - loss: 0.1434 - acc: 0.9485 - val_loss: 0.4803 - val_acc: 0.8232
Epoch 8/10
 - 35s - loss: 0.1270 - acc: 0.9521 - val_loss: 0.4938 - val_acc: 0.8318
Epoch 9/10
 - 35s - loss: 0.1111 - acc: 0.9607 - val_loss: 0.5395 - val_acc: 0.8244
Epoch 10/10
 - 36s - loss: 0.1136 - acc: 0.9587 - val_loss: 0.5950 - val_acc: 0.8302


In [53]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.85568

In [54]:
input_text = """Beauty and the Beast is about a deposed slave-owning aristocrat who imprisons a farm girl. She undergoes Stockholm Syndrome, identifying with her captor, then proceeds to betray her village's uprising and reinstates the slave-owning prince to power by offering her hand in marriage.

Furthermore, Belle's contempt for the provincial farming community and their lack of refinement stems from vague memories she has of a more cultured upbringing in Paris. When she later is shown a vision of her childhood house and remarks "it's so small," this was a moment where she could put it all together. 

The lack of refinement in the rural areas was due to brutal exploitation which forced unmarried women to beg in the streets. It is likely that the community's surplus resources were taken by aristocrats like the Beast, and used to fund his opulent palace. Thus, depriving the farming community of leisure time and resources for education and arts, which would have made them more sophisticated, meeting Belle's approval.

It is also possible that Gaston's intense desire to marry, which caused his nefarious plot, may be linked to levée en masse, a policy that required conscription for all unmarried French men between 18 and 25. So his patriarchal demands were a direct result of state policy to benefit the aristocracy by providing soldiers to sacrifice their lives in land disputes between inbred blue blood cousins.

Then, this exploitation provided a concentration of wealth and power in the city, which created the market for her father to pursue creative employment rather than farm work. This also forced them into slums, where squalor and poor public health systems lead to the spread of plague, which is met with cold indifference by the doctor, indicating lack of public health care as a source of Belle's childhood trauma. 

All of this exploitation and upward wealth transfer made its way back to the remote plantation of the Beast.

When confronted with this inescapable logic, what does she do? She decides to take the easy way out and enjoy the life of luxury, waited hand and foot by Beast's slaves, who feed her, clothe her, sing and dance for her. A life she always felt entitled to, on part of her feeling of superiority towards her provincial neighbors.

The moral of the story is, marry for money, and ignore the suffering of the poor. A terrible message for children."""

In [55]:
predict_review(input_text)

負面
